Assignment 3: Genome-Wide Association Studies

Assignment Date: Friday, Sept. 24, 2021
Due Date: Friday, Oct. 1, 2021 @ 1:00pm ET

Lecture

Slides are available here: 20210924_qblab_gwas.pptx

One of the central goals of human genetics is clarifying the relationship between genotypes and phenotypes. Genome-wide association studies (GWAS) emerged ~20 years ago as a useful approach for discovering genetic variation that underlies variation in human traits. At their core, GWAS involve fitting linear models to test for relationships between polymorphisms (often SNPs) and phenotypes using data from large samples of individuals.

Genotypes of nearby SNPs are correlated–a phenomenon termed “linkage disequilibrium” or “LD”. This is both a blessing and a curse for GWAS. On one hand, LD means that we need not genotype every SNP to discover associations. We merely need to genotype “tag SNPs” that segregate in LD with variants that causally influence the phenotype. On the other hand, this also means that even when we find a signficant association, it is often challenging to disentangle the causal gene and/or variant that drives the association.

Background

Today we will conduct a GWAS using a materials adapted from workshop by Heather Wheeler from Loyola University Chicago (https://github.com/hwheeler01/GWAS_workshop). Note that these phenotype data are simulated. The combination of genotype and phenotype data poses a privacy risk, so real genotype and phenotype data from humans are often stored in controlled-access databases such as dbGaP.

The premise for the exercise is that you are part of a company developing two new cancer drugs called GS451 and CB1908. Some individuals in early phase trials have experienced a side effect of these drugs called lymphocytopenia (low lymphocyte counts). You are now tasked with performing a GWAS on data from lymphblastoid cell lines to search for risk factors for lymphocytopenia. The phenotype that was measured is the IC50, defined as the concentration of the drug at which 50% viability occurs. Phenotype data are stored in the CB1908_IC50.txt and GS451_IC50.txt files. Genotype data are stored in a VCF file (genotypes.vcf)—a format that you should recognize from last class.

Downloading Data

Create a new week3 directory in your qbb2021-answers directory. In this directory, download the three files you’ll need for this assignment:

wget https://github.com/bxlab/qbb2021/raw/main/week3/genotypes.vcf
wget https://github.com/bxlab/qbb2021/raw/main/week3/GS451_IC50.txt
wget https://github.com/bxlab/qbb2021/raw/main/week3/CB1908_IC50.txt

You shouldn’t be pushing any of these files to your remote repo, so update your .gitignore file as necessary.

Assignment

  1. Visualize genetic relatedness between the strains by performing principal component analysis and plotting the first two components.
  2. Visualize the allele frequency spectrum by plotting a histogram of allele frequencies.
  3. Using plink, perform quantitative association testing for each phenotype. Use the top 10 principal components (eigenvectors) as covariates in your analysis, to adjust for non-independence due to relatedness.
    • Be sure to use the --allow-no-sex option
    • You may find this portion of the plink documentation helpful for performing association testing on each of the phenotypes
  4. For each phenotype, produce a QQ plot and Manhattan plot. For the Manhattan plot, highlight SNPs with p-values less than 10-5 in a different color.
  5. Choose one of the traits for which you performed GWAS. For the top associated SNP, visualize the effect size by creating a boxplot of the phenotype stratified by genotype.

Hints

  • You can perform the PCA and allele frequency calculations in Python, but plink can perform both as well. Either way, you will probably want to use matplotlib to produce the plots.
  • For PCA, be sure you have one datapoint per individual, as opposed to per SNP.
  • Each SNP only has one allele frequency.
  • Notice how the IDs are encoded in the phenotype files phenotypes. plink expects both a family ID and a sample ID. In your VCF, they are separated by an underscore, but in your phenotype file they are separated by a tab-character. The files we provided are already formatted in this way, but keep this in mind for future work where reformatting may be necessary.

Submit

Submit to your qbb2021-answers repository:

  • All scripts
  • A record of your command-line commands, if applicable
  • All plots

DO NOT push any raw data to Github unless we explicitly ask for it!

Advanced Exercises

  1. For the top loci associated with each of the phenotypes, use the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway) to investigate the potential causal genes in the region. Note that the reference genome build being used here is hg18. (How could you figure this out if you didn’t know?)

  2. Examine the frequency of the top associated variants. What is its frequency in various global populations? Use the GGV Browser to explore this: https://popgen.uchicago.edu/ggv

  3. Is the variant itself necessarily causal in driving the association? Investigate the local haplotype structure using the LDLink website’s “LDproxy” tool: https://ldlink.nci.nih.gov/?tab=ldproxy