SNP level filtering (Part II) - Hardy-Weinberg Equilibrium

(This lab was adapted for statsTeachR by Tu Dao, Siying Chen, Nicholas Reich and Andrea Foulkes from our GWAS tutorial.)

After filtering the sample, we proceed to filter the SNPs with Hardy Weinberg Equilibrium (HWE). HWE states that probability of an allele occuring on one homolog does not depend on which allele is present on the other homolog in a large, randomly mating population that remain stable over generations. Violation of HWE can be an indication of the presence of population substructure or occurence of a genotyping error. While it is not always true, it is common to remove any SNPs that violated HWE to avoid genotyping error. Departures of HWE at a given SNPs are usually measured with a chi-square goodnness-of-fit test between the observed and expected genotypes.

In this lab, we filter our SNPs based on HWE p < 1*10^-6 in CAD controls. It results in 7643 SNPs remain for the association analysis.

## The downloaded binary packages are in
##  /var/folders/_9/8f93yrbn70v__gsn6550wbqr0000gn/T//Rtmp2scphX/downloaded_packages
## The downloaded binary packages are in
##  /var/folders/_9/8f93yrbn70v__gsn6550wbqr0000gn/T//Rtmp2scphX/downloaded_packages
hardy <- 10^-6     
CADcontrols <- as.character(clinical[ clinical$CAD==0, 'FamID' ])
snpsum.colCont <- col.summary(genotype[CADcontrols,])
HWEuse <- with(snpsum.colCont, ! & ( abs(z.HWE) < abs( qnorm(hardy/2) ) ) )
HWEuse[] <- FALSE          
cat(ncol(genotype)-sum(HWEuse),"SNPs will be removed due to high HWE.\n") 
## 765 SNPs will be removed due to high HWE.
genotype <- genotype[,HWEuse]
## A SnpMatrix with  1400 rows and  381257 columns
## Row names:  10002 ... 11596 
## Col names:  rs4579145 ... rs946221


Why do we have to filter the sample before filter the SNPs with HWE? How many SNPs have a HWE p-value of 10^-5 or less?

Saving work for following labs…

You can save the necessary data from this lab by running the following commands:

save(genotype, genoBim, clinical, file= "lab4_save.RData")

On your own

  1. In the example above, why do we only use CAD controls when filtering?

  2. In the example above,what are the 5 five SNPs with highest HWE?

  3. Write out the form of the chi-square goodness-of-fit test statistic. Be sure to define each term.

  4. Load the exercise data that was saved in the previous lab using the command load(“exercise_dat.RData”). Filter on HWE using the exercise data.

  5. Recall that this is the data before sample level filtering was performed. What does this imply about the results?