One of the central goals of human genetics is clarifying the relationship between genotypes and phenotypes. Genome-wide association studies (GWAS) emerged around 20 years ago as a useful approach for discovering genetic variation that underlies variation in human traits. At their core, GWAS involve fitting linear models to test for relationships between polymorphisms (often SNPs) and phenotypes using data from large samples of individuals.
Genotypes of nearby SNPs are correlated–a phenomenon termed “linkage disequilibrium” or “LD”. This is both a blessing and a curse for GWAS. On one hand, LD means that we need not genotype every single SNP to discover associations. We merely need to genotype “tag SNPs” that segregate in LD with variants that causally influence the phenotype. On the other hand, this also means that even when we find a signficant association, it is often challenging to disentangle the causal gene and/or variant that drives the association.
Today we will conduct a GWAS using a materials adapted from workshop by Heather Wheeler from Loyola University Chicago (https://github.com/hwheeler01/GWAS_workshop). Note that these phenotype data are simulated. The combination of genotype and phenotype data poses a privacy risk, so real genotype and phenotype data from humans are often stored in controlled-access databases such as dbGaP.
The premise for the exercise is that you are part of a company developing two new cancer drugs called GS451 and CB1908. Some individuals in early phase trials have experienced a side effect of these drugs called lymphocytopenia (low lymphocyte counts). You are now tasked with performing a GWAS on data from lymphblastoid cell lines to search for risk factors for lymphocytopenia. The phenotype that was measured is the IC50, defined as the concentration of the drug at which 50% viability occurs.
All the data you need for this assignment is in a zipped folder here: ~/cmdb-quantbio/assignments/lab/GWAS/extra_data/gwas_data.tar.gz
. Make a working copy of this file within your weekly homework directory.
After copying the zipped folder, you’ll need to extract it with tar -zxvf <filename.tar.gz>
. You should get 3 files:
The phenotype data (i.e., the IC50 for each drug) are stored in the CB1908_IC50.txt
and GS451_IC50.txt
files. Genotype data are stored in a VCF file (genotypes.vcf
)—a format that you should recognize by now.
As always, if your .gitignore
is not already set up to ignore these files (the VCF and the two phenotype files), you should update it so that they are ignored. You should NOT be uploading any of these files.
Before you begin the exercises, create a plotting.py
script in your submission directory. You’ll be using this script to visualize 1) interesting summaries of your data and 2) your GWAS results.
Also, create a README.md
in your submission directory. You’ll be recording the commands you used for this assignment in this file, and submitting it with your plots.
First, you want to perform PCA to compute the top genotype PCs for the samples in your data set. This is useful not only for visualizing the population structure in your data set, but also allows you to control for this structure when you perform your GWAS later.
Using plink
, perform PCA on the genotypes of the cell lines in the data set. Output the top 10 principal components.
Record the plink
command(s) you used in your README.md
.
Now that you’ve computed the top genotyping PCs, you can visualize the population structure in your data set.
In your plotting.py
script, plot the top 2 genotype principal components from plink
to visualize the genetic relatedness between the cell lines in the data set. Ensure that your plot is nicely formatted and properly labeled.
The shape of the allele frequency specturm (AFS) can tell you interesting things about your sample, such as demographic history, and evidence of selection (read more here. While you will not be running these specific tests today, it is still interesting (and good practice) to visualize the AFS.
Usually, the AF of each variant is stored in the INFO
field in your VCF. The vcf file we’re using does not have an INFO
field, so you’ll need to calculate AF yourself.
Using plink
, calculate allele frequencies for the variants in your VCF and output these to a file.
Record the plink
command(s) you used in your README.md
.
Now that you’ve computed allele frequencies, you can visualize the AFS of the samples in your data set.
In your plotting.py
script, plot the AFS of the samples in your data set as a histogram. Ensure that your plot is nicely formatted and properly labeled.
Now you’re ready to run your GWAS!
Using plink
, perform quantitative association testing for each of your two phenotypes (you will be running plink
one each per phenotype). Use the top 10 principal components you computed in Step 1.1 as covariates in your analysis to control for non-independence due to relatedness/population structure.
<li> Be sure to use the --allow-no-sex
option</li>
<li> You may find this portion of the plink
documentation helpful for performing association testing on each of the phenotypes.</li>
<li> HINT: We do NOT have case/control phenotpe data, so --assoc
is not the correct plink
flag.
We encourage you to try this on your own first, but you can use the hint below to check your answer (this WILL give you the full plink
command, be warned).
plink --vcf genotypes.vcf --linear --pheno phenotype.txt --covar pca.eigenvec --allow-no-sex --out phenotype_gwas_results
You should have two output files, one for each phenotype.
Record the plink
commands you used in your README.md
.
Now that you’ve run your GWAS analyses, you want to visualize the results.
In your plotting.py
script, produce a two-panel figure (one column, two rows) depicting the Manhattan plot for each of your two GWAS analyses (i.e. each phenotype). Each Manhattan plot should be one of the two panels in your figure. In each Manhattan plot, highlight SNPs with p-values less than 10-5 in a different color.
Ensure that your figure/panels are nicely formatted and properly labeled.
You want to dig deeper into the top GWAS hits from your analyses.
Choose one of the traits for which you performed GWAS.
In your plotting.py
script: for the top associated SNP (lowest p-value) of that trait, plot the effect size of that variant on the chosen trait by creating a boxplot of the phenotype stratified by genotype.
For the top loci associated with each of your two phenotypes, use the UCSC Genome Browser to investigate the potential causal genes in the region. Note that the reference genome build being used here is hg18. (How could you figure this out if you didn’t know?).
In your README.md
, summarize what you found (i.e. what gene(s) are closest to each top hit? How might these genes impact the trait?).
For this assignment you should submit the following:
README.md
with commands and analyses (2 points)
plotting.py
script to produce plots (4 points)
DO NOT push any raw data! Only the things we asked for!
In addition to your Manhattan plot, produce a QQ plot for each of the two traits, summarizing the results.
Examine the frequency of the top associated variants. What is its frequency in various global populations? Use the GGV Browser to explore this: https://popgen.uchicago.edu/ggv
Is the variant itself necessarily causal in driving the association? Investigate the local haplotype structure using the LDLink website’s “LDproxy” tool: https://ldlink.nci.nih.gov/?tab=ldproxy