Assignment Date: Friday, Sept. 24, 2021
Due Date: Friday, Oct. 1, 2021 @ 1:00pm ET
Slides are available here: 20210924_qblab_gwas.pptx
One of the central goals of human genetics is clarifying the relationship between genotypes and phenotypes. Genome-wide association studies (GWAS) emerged ~20 years ago as a useful approach for discovering genetic variation that underlies variation in human traits. At their core, GWAS involve fitting linear models to test for relationships between polymorphisms (often SNPs) and phenotypes using data from large samples of individuals.
Genotypes of nearby SNPs are correlated–a phenomenon termed “linkage disequilibrium” or “LD”. This is both a blessing and a curse for GWAS. On one hand, LD means that we need not genotype every SNP to discover associations. We merely need to genotype “tag SNPs” that segregate in LD with variants that causally influence the phenotype. On the other hand, this also means that even when we find a signficant association, it is often challenging to disentangle the causal gene and/or variant that drives the association.
Today we will conduct a GWAS using a materials adapted from workshop by Heather Wheeler from Loyola University Chicago (https://github.com/hwheeler01/GWAS_workshop). Note that these phenotype data are simulated. The combination of genotype and phenotype data poses a privacy risk, so real genotype and phenotype data from humans are often stored in controlled-access databases such as dbGaP.
The premise for the exercise is that you are part of a company developing two new cancer drugs called GS451 and CB1908. Some individuals in early phase trials have experienced a side effect of these drugs called lymphocytopenia (low lymphocyte counts). You are now tasked with performing a GWAS on data from lymphblastoid cell lines to search for risk factors for lymphocytopenia. The phenotype that was measured is the IC50, defined as the concentration of the drug at which 50% viability occurs. Phenotype data are stored in the CB1908_IC50.txt
and GS451_IC50.txt
files. Genotype data are stored in a VCF file (genotypes.vcf
)—a format that you should recognize from last class.
Create a new week3
directory in your qbb2021-answers
directory. In this directory, download the three files you’ll need for this assignment:
wget https://github.com/bxlab/qbb2021/raw/main/week3/genotypes.vcf
wget https://github.com/bxlab/qbb2021/raw/main/week3/GS451_IC50.txt
wget https://github.com/bxlab/qbb2021/raw/main/week3/CB1908_IC50.txt
You shouldn’t be pushing any of these files to your remote repo, so update your .gitignore
file as necessary.
plink
, perform quantitative association testing for each phenotype. Use the top 10 principal components (eigenvectors) as covariates in your analysis, to adjust for non-independence due to relatedness.
--allow-no-sex
optionplink
documentation helpful for performing association testing on each of the phenotypesplink
can perform both as well. Either way, you will probably want to use matplotlib to produce the plots.plink
expects both a family ID and a sample ID. In your VCF, they are separated by an underscore, but in your phenotype file they are separated by a tab-character. The files we provided are already formatted in this way, but keep this in mind for future work where reformatting may be necessary.Submit to your qbb2021-answers
repository:
DO NOT push any raw data to Github unless we explicitly ask for it!
For the top loci associated with each of the phenotypes, use the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway) to investigate the potential causal genes in the region. Note that the reference genome build being used here is hg18. (How could you figure this out if you didn’t know?)
Examine the frequency of the top associated variants. What is its frequency in various global populations? Use the GGV Browser to explore this: https://popgen.uchicago.edu/ggv
Is the variant itself necessarily causal in driving the association? Investigate the local haplotype structure using the LDLink website’s “LDproxy” tool: https://ldlink.nci.nih.gov/?tab=ldproxy