Sample level filtering

(This lab was adapt for statsTeachR by Tu Dao,Siying Chen, Nicholas Reich and Andrea Foulkes from our GWAS tutorial.)

For better result, we need to filter our sample. The second stage of data pre-processing involves filtering samples, i.e. removing individuals due to missing data, sample contamination, correlation (for population-based investigations), and racial/ethnic or gender ambiguity or discordance. In our study, we address these issues by filtering on call rate, heterozygosity, cryptic relatednes and duplicates using identity-by-descent, and we visually assess ancestry.

Getting started

To continue the data pre-processing stage, we need to load the data saved from lab 2 using the load() command. Sample level quality control for missing data and heterozygosity is achieved using the row.summary function from snpStats.

load("lab2_save.RData")
library(snpStats)
source("http://bioconductor.org/biocLite.R")
biocLite("SNPRelate")
## 
## The downloaded binary packages are in
##  /var/folders/_9/8f93yrbn70v__gsn6550wbqr0000gn/T//Rtmp8kW974/downloaded_packages
library(SNPRelate)                     
library(plyr)
snpsum.row <- row.summary(genotype)

Basic sample level filtering

Heterozygosity

Heterozygosity refers to the presence of each of two alleles at a given SNP within an individual. This is expected under Hardy-Weinberg Equilibrium (HWE) to occur with probability of 2p(1-p) where p is the dominant allele frequency at that SNP (assuming a bi-allelic SNP). Excess or deficient heterozygosity across typed SNPs within an individual is commonly considered as an indication of inbreeding between relatives, but also serves as a screen for sample contamination or poor sample quality.

In this lab, heterozygosity F statistic calculation is carried out with the form, |F|=(1-O/E), where O is observed proportion of heterozygous genotypes for a given sample and E is the expected proportion of heterozygous genotypes for a given sample based on the minor allele frequency accross all non-missing SNPs for a given sample. We remove any sample with inbreeding coefficient |F| > 0.10, which might indicates interbreeding or bad quality sample. We also remove any subjects with NA values.

MAF <- snpsum.col$MAF
callmatrix <- !is.na(genotype)
hetExp <- callmatrix %*% (2*MAF*(1-MAF))
hetObs <- with(snpsum.row, Heterozygosity*(ncol(genotype))*Call.rate)
snpsum.row$hetF <- 1-(hetObs/hetExp)
head(snpsum.row)
##       Call.rate Certain.calls Heterozygosity         hetF
## 10002 0.9826528             1      0.3281344 -0.022019308
## 10004 0.9891708             1      0.3231777 -0.006829093
## 10005 0.9918277             1      0.3239852 -0.008817640
## 10007 0.9860113             1      0.3241203 -0.009882026
## 10008 0.9824172             1      0.3231505 -0.008550116
## 10009 0.9912570             1      0.3205117  0.002331055
hetcutoff <- 0.1   
sampleuse <- with(snpsum.row, abs(hetF) <= hetcutoff)
sampleuse[is.na(sampleuse)] <- FALSE  
cat(nrow(genotype)-sum(sampleuse), "subjects will be removed due to low sample call rate or inbreeding coefficient.\n")
## 0 subjects will be removed due to low sample call rate or inbreeding coefficient.

After idenitfying the subjects to be removed, we subset the genotype and clinical data set using information obtained above to remove the subjects that do not pass the criteria from our sample.

genotype <- genotype[sampleuse,]
clinical<- clinical[rownames(genotype), ]
Exercise 1

Based on our code above for calculation of F-statistics, given that

print(MAF[13])
## [1] 0.02607143

Calculate heterozygosity of this subject on a separate piece of paper.

Call rate

To filter by call rate, we remove individuals that are missing genotype data across more than a pre-defined percentage of the typed SNPs. This proportion of missingness across SNPs is the sample call rate and we apply a threshold of 5%, meaning that we drop those individuals missing genotype data for more than 5% of the typed SNPs. A new reduced dimesion SnpMatrix gentotype is created from this filter.

sampcall <- 0.95   
sampleuse1 <- with(snpsum.row, !is.na(Call.rate) & Call.rate > sampcall)
sampleuse1[is.na(sampleuse1)] <- FALSE   
cat(nrow(genotype)-sum(sampleuse1), "subjects will be removed due to low sample call rate or inbreeding coefficient.\n")
## 0 subjects will be removed due to low sample call rate or inbreeding coefficient.
genotype <- genotype[sampleuse1,]
clinical<- clinical[ rownames(genotype), ]

Exercise 2

What’s the difference between SNP-level and sample-level filtering for call rates?

Cryptic relateness, duplicates and gender identity

Population-based cohort studies often only concerns unrelated individuals and the generalized linear modeling approach assumes independence across individuals. In regional cohort studies (e.g. hospital-based cohort studies) of complex diseases, however, individuals from the same family could be unintentionally recruited. To mitigate this problem, we thus employ Identity-by-descent (IBD) analysis which is a common measurement of relateness and duplication. Note that gender identity can also be checked at this stage to confirm self-reported gender consistency on X and Y chromosomes. With this data, we do not check gender identity due to unavailibility of sex chromosomes information.

To filter on relateness criteria, we use the SNPRelate package to perform identity-by-descent (IBD) analysis. This package requires that the data be transformed into a GDS format file. IBD analysis is performed on only a subset of SNPs that are in linkage equilibrium by iteratively removing adjacent SNPs that exceed an linkage disequilibrium (LD) threshold in a sliding window using the snpgdsLDpruning() function.In this lab, we set LD threshold to be 20% and kinship threshold to be 10%, which means that we would remove any SNPs whose relevant statistics are higher than our stated threshold values. Then we used snpgdsBED2GDS to create a GDS file of the data set from .bim, .bed and .fam files.

ld.thresh <- 0.2    
kin.thresh <- 0.1   
snpgdsBED2GDS("GWAS_data.bed", "GWAS_data.fam", "GWAS_data.bim","GWAS_data.gds")
## Start snpgdsBED2GDS ...
##  BED file: "GWAS_data.bed" in the SNP-major mode (Sample X SNP)
##  FAM file: "GWAS_data.fam", DONE.
##  BIM file: "GWAS_data.bim", DONE.
## Thu Aug 27 11:58:23 2015     store sample id, snp id, position, and chromosome.
##  start writing: 1401 samples, 500000 SNPs ...
##      Thu Aug 27 11:58:23 2015    0%
##      Thu Aug 27 11:58:32 2015    100%
## Thu Aug 27 11:58:33 2015     Done.
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##  open the file "GWAS_data.gds" (size: 179677809).
##  # of fragments in total: 39.
##  save it to "GWAS_data.gds.tmp".
##  rename "GWAS_data.gds.tmp" (size: 179677557).
##  # of fragments in total: 18.
genofile <- openfn.gds("GWAS_data.gds", readonly = FALSE)
gds.ids <- read.gdsn(index.gdsn(genofile, "sample.id"))
gds.ids <- sub("-1", "", gds.ids)
add.gdsn(genofile, "sample.id", gds.ids, replace = TRUE)

After creating and removing samples with automatically added suffixes, we have to prune the SNPs based on linkage disequilibrium for IBD analysis.

set.seed(1000)
geno.sample.ids <- rownames(genotype)
snpSUB <- snpgdsLDpruning(genofile, ld.threshold = ld.thresh,
                          sample.id = geno.sample.ids, 
                          snp.id = colnames(genotype)) 
## SNP pruning based on LD:
## Excluding 117978 SNPs on non-autosomes
## Excluding 0 SNP (monomorphic: TRUE, < MAF: NaN, or > missing rate: NaN)
## Working space: 1401 samples, 382022 SNPs
##  Using 1 (CPU) core
##  Sliding window: 500000 basepairs, Inf SNPs
##  |LD| threshold: 0.2
## Chromosome 4: 75.65%, 24341/32174
## Chromosome 6: 77.35%, 24251/31353
## Chromosome 9: 77.01%, 18363/23846
## Chromosome 18: 75.73%, 11477/15156
## Chromosome 3: 76.32%, 26881/35222
## Chromosome 17: 75.83%, 8783/11583
## Chromosome 19: 76.03%, 5071/6670
## Chromosome 11: 76.12%, 19474/25583
## Chromosome 10: 76.53%, 21411/27978
## Chromosome 1: 75.08%, 31002/41294
## Chromosome 13: 76.03%, 15115/19880
## Chromosome 8: 76.89%, 21622/28122
## Chromosome 12: 76.43%, 18631/24377
## Chromosome 22: 74.48%, 4875/6545
## Chromosome 2: 75.94%, 32479/42770
## Chromosome 16: 75.62%, 12164/16085
## Chromosome 5: 77.26%, 25166/32572
## Chromosome 7: 77.56%, 20985/27058
## Chromosome 15: 76.05%, 11429/15028
## Chromosome 14: 76.46%, 12447/16280
## Chromosome 20: 76.10%, 10076/13241
## Chromosome 21: 76.75%, 5513/7183
## 381556 SNPs are selected in total.
snpset.ibd <- unlist(snpSUB, use.names=FALSE)
cat(length(snpset.ibd),"will be used in IBD analysis\n")
## 381556 will be used in IBD analysis

When all the SNPs are pruned, we need to find the IBD coefficients using Method of Moments procedure, including pairwise kinship

ibd <- snpgdsIBDMoM(genofile, kinship=TRUE,
                    sample.id = geno.sample.ids,
                    snp.id = snpset.ibd,
                    num.thread = 1)
## IBD analysis (PLINK method of moment) on SNP genotypes:
## Excluding 118444 SNPs on non-autosomes
## Excluding 0 SNP (monomorphic: TRUE, < MAF: NaN, or > missing rate: NaN)
## Working space: 1401 samples, 381556 SNPs
##  Using 1 (CPU) core
## PLINK IBD:   the sum of all working genotypes (0, 1 and 2) = 251556755
## PLINK IBD:   Thu Aug 27 11:58:45 2015    0%
## PLINK IBD:   Thu Aug 27 11:59:19 2015    18%
## PLINK IBD:   Thu Aug 27 11:59:52 2015    36%
## PLINK IBD:   Thu Aug 27 12:00:25 2015    54%
## PLINK IBD:   Thu Aug 27 12:00:59 2015    72%
## PLINK IBD:   Thu Aug 27 12:01:31 2015    91%
## PLINK IBD:   Thu Aug 27 12:01:47 2015    100%
ibdcoeff <- snpgdsIBDSelection(ibd)    
head(ibdcoeff)
##     ID1   ID2        k0         k1     kinship
## 1 10002 10004 0.9651755 0.03482454 0.008706134
## 2 10002 10005 0.9694550 0.03054498 0.007636244
## 3 10002 10007 0.9327868 0.06721323 0.016803307
## 4 10002 10008 0.9218918 0.07810819 0.019527049
## 5 10002 10009 0.9666561 0.03334394 0.008335986
## 6 10002 10010 0.9715478 0.02845221 0.007113054

Now we have IBD coefficient, we can check to see if there are any candidates for relateness using kinship threshold. We remove any samples with high kinship starting with the sample with the most pairings.

ibdcoeff <- ibdcoeff[ ibdcoeff$kinship >= kin.thresh, ]
related.samples <- NULL
while ( nrow(ibdcoeff) > 0 ) {
    sample.counts <- arrange(count(c(ibdcoeff$ID1, ibdcoeff$ID2)), -freq)
    rm.sample <- sample.counts[1, 'x']
    cat("Removing sample", as.character(rm.sample), 'too closely related to', 
        sample.counts[1, 'freq'],'other samples.\n')

    ibdcoeff <- ibdcoeff[ibdcoeff$ID1 != rm.sample & ibdcoeff$ID2 != rm.sample,]
    related.samples <- c(as.character(rm.sample), related.samples)
}
## Removing sample 10347 too closely related to 1 other samples.

Having removed the related samples, we proceed to filter the genotype and clinical data to include only unrelated samples. After filtering, only 1396 subjects remain.

genotype <- genotype[ !(rownames(genotype) %in% related.samples), ]
clinical <- clinical[ !(clinical$FamID %in% related.samples), ]

geno.sample.ids <- rownames(genotype)

cat(length(related.samples), "similar samples removed due to correlation coefficient >=", 
    kin.thresh,"\n") 
## 1 similar samples removed due to correlation coefficient >= 0.1
print(genotype)        
## A SnpMatrix with  1400 rows and  382022 columns
## Row names:  10002 ... 11596 
## Col names:  rs4579145 ... rs946221

Exercise 3

Why do we need to prune the SNPs data for linkage equilibrium before IBD analysis?

Ancestry

Principle components (PCs) from multi-dimensional scaling (MDS) is one approach to visualizing and classifying individuals into ancestry groups based on their observed genetics make-up. There are two reasons for doing this. First, self-reported race/ethnicity can differ from clusters of individuals that are based on genetics infomation. Second, the presence of an individual not appearing to fall within a racial/ethnic cluster may be suggestive of a sample error.

Exercise 4

Without looking at the plot, can you predict how many sample we going to remove based on information on the data set in lab 1? Explain.

To better understand ancestry, we plot the first two principal components of the genotype data. Principal component calculation is achieved via the snpgdsPCA function from SNPRelate. Then the data frame of first two principle components is created. Then we plot the first two principle components to look for trend and outliers. To decide whether or not to remove any samples, we have to look for any obvious outliers on the ancestry plot.

pca <- snpgdsPCA(genofile, sample.id = geno.sample.ids,  snp.id = snpset.ibd, num.thread=1)
## Principal Component Analysis (PCA) on SNP genotypes:
## Excluding 118444 SNPs on non-autosomes
## Excluding 0 SNP (monomorphic: TRUE, < MAF: NaN, or > missing rate: NaN)
## Working space: 1400 samples, 381556 SNPs
##  Using 1 (CPU) core
## PCA: the sum of all working genotypes (0, 1 and 2) = 251378768
## PCA: Thu Aug 27 12:02:05 2015    0%
## PCA: Thu Aug 27 12:02:35 2015    16%
## PCA: Thu Aug 27 12:03:05 2015    32%
## PCA: Thu Aug 27 12:03:35 2015    48%
## PCA: Thu Aug 27 12:04:05 2015    64%
## PCA: Thu Aug 27 12:04:35 2015    80%
## PCA: Thu Aug 27 12:05:05 2015    96%
## PCA: Thu Aug 27 12:05:11 2015    100%
## PCA: Thu Aug 27 12:05:11 2015    Begin (eigenvalues and eigenvectors)
## PCA: Thu Aug 27 12:05:12 2015    End (eigenvalues and eigenvectors)
pctab <- data.frame(sample.id = pca$sample.id,
                    PC1 = pca$eigenvect[,1],    
                    PC2 = pca$eigenvect[,2],    
                    stringsAsFactors = FALSE)
plot(pctab$PC2, pctab$PC1, xlab="Principal Component 2", ylab="Principal Component 1", 
     main = "Ancestry Plot")

Because we know that the population used in the PennCath data was homogenous from European ancestry, we can conclude that there are no obvious outliers.

Saving work for following labs…

Before saving the data, we have to close the GDS file using closefn.gds() command. You can save the necessary data from this lab by running the following commands:

closefn.gds(genofile)
save(genotype, snpsum.col, snpsum.row, genofile, genoBim, clinical, file= "lab3_save.RData")

On Your Own

In the following exercises, we will utilize a different subset of the PennCath data described in the first lab. Reading in the data from the web is similar here to when we read the original subset in the first lab. We will also save this exercise data for future use.

library(downloader)
bedFile<- download("https://www.mtholyoke.edu/courses/afoulkes/Data/statsTeachR/GWAS_ex.bed",
destfile = "data1.bed")
bimFile<- download("https://www.mtholyoke.edu/courses/afoulkes/Data/statsTeachR/GWAS_ex.bim",
destfile = "data1.bim")
famFile<- download("https://www.mtholyoke.edu/courses/afoulkes/Data/statsTeachR/GWAS_ex.fam",
destfile = "data1.fam")

exercise_dat <- read.plink("data1.bed","data1.bim","data1.fam",na.strings=("-9"))
geno1 <- exercise_dat$genotypes
fam1 <- exercise_dat$fam
bim1 <- exercise_dat$map
save(exercise_dat,file="exercise_dat.RData")

Use this new subset of data from the PenCath cohort, complete the following sample level filtering procedures 1. Please report at each step how many SNPs are removed/remained (if applicable) and answer the additional questions : a) Filter by hetereozigosity. b) Filter by call rates. c) Filter by relatedness. How many SNPs were involved in the IBD analysis? d) Is there any indication of population substructure in this data? Please explain with the appropriate plot and state your reasoning.

  1. Use your results from the first problem. Answer the following questions:
  1. In your IBD analysis, what’s the SNP that has the largest number of pairings? How do you know?(If there are multiple, then only name one of them)
  2. When checking for the appropriate heterozigosity, what’s the top 3 largest inbreeding coefficients?