Today you will perform alignment and variant calling across multiple haploid yeast strains. These strains are the progeny of a cross between a lab strain and a wine strain. The data come from this paper: “Finding the sources of missing heritability in a yeast cross”. After performing variant calling, you’ll be doing some basic exploratory analysis of the genetic variation you discovered.
As part of this assignment, you’ll be working with ten sets of single-end Illumina sequencing reads, each for a different yeast strain. These are in a tarball on Dropbox. You can download the data into your current directory with the following commands:
wget https://www.dropbox.com/s/ogx7nbhlhrp3ul6/BYxRM.tar.gz
tar -xvzf BYxRM.tar.gz
You will be aligning reads from your yeast samples to the Saccharomyces cerevisiae reference genome. This reference is called sacCer3 by the UCSC genome browser, but its name in the NCBI Assembly archive is R64-1-1. You’ll need this info later for snpEff.
We have provided a single whole genome reference file for you:
/Users/cmdb/data/genomes/sacCer3.fa
Make a working copy of this file within your weekly homework directory.
As always, if your .gitignore
is not already set up to ignore these files (the reads and the reference), you should update it so that they are ignored. You should NOT be uploading any of these files.
There are three exercises in today’s assignment:
The first two exercises will occur entirely in the command line. As such, you will be writing and submitting an alignment-variant_calling.sh
bash script that will run all of the code for exercises 1 and 2.
NOTE: Some of the steps can take a while. When you’ve finished running each step, you can comment out that section of your bash script so that it doesn’t run every time you run your script. However, please uncomment these lines before submitting your script.
Create an empty alignment-variant_calling.sh
script now.
During the first exercise, you’ll be aligning the reads from each of the 10 yeast strains to the sacCer3 reference.
You’ll be using a tool called bwa
(BWA manual) to perform alignment. Before you can align your sequencing reads, bwa
needs you to index the sacCer3 genome. Without getting into the the nitty gritty, this essentially means creating a table of contents (and index) that bwa
can use to quickly find matches between your reads and the reference genome.
Using bwa index
, create an index for the sacCer3.fa
reference.
Now that you’ve indexed the reference, you can align your reads to the reference using bwa mem
. Because you’ll want to run this step the same way on all 10 strains, it makes sense to do this step in a bash for
loop. Consider this resource for how to write a for loop in a bash script: bash for loops walkthrough.
Create a bash for
loop that loops through each of the 10 samples. For each sample, use bwa mem
to align the reads to the reference.
IT IS VERY IMPORTANT that you assign each sample a read group during this process, so that individual samples can be distinguished later in Step 2.1. You can do this with the (somewhat cryptic) -R
flag, which you use to add a line to the header of each output alignment file. An example of a header line you can add with the -R
flag is "@RG\tID:Sample1\tSM:Sample1"
. You can replace “Sample1” here with the appropriate sample name for each of your yeast strains.
Perhaps consider the -t
and -o
flags as well.
Now that you’ve aligned your reads to the reference, you should have 10 .sam
files, one for each sample. These files contain all of the alignments for each yeast strain. You can see how they’re organized with less -S
, and you can read more about the SAM format here.
These files contain all of the information you need for variant calling, but before you can do that, they’ll need to be sorted and indexed (similar to how you indexed the reference in Step 1.1. For both of these tasks you can use the samtools
program (manual here, or you can just run samtools help
).
First, sort each of your .sam
files using samtools sort
. You can do this in a new for
loop in your bash script or, even better, in the same for
loop you used for alignment. You’ll want to output these sorted files as .bam
files, which contain the same information as the .sam
file but are compressed.
Perhaps consider the -O
and -o
flags when running samtools sort
.
Next, create an index for each of the resulting sorted .bam
files using samtools index
. As before, you can do this in a new for
loop in your bash script or in the same for
loop as the previous two steps.
At the end of this step, you should have 10 sorted .bam
files and their corresponding .bam.bai
indices.
Now that you’ve aligned the sequencing reads to the reference, you can call genetic variants across the yeast strains.
For variant calling, you’ll be using a tool called freebayes
(manual here, or you can just run freebayes --help
).
Use freebayes
to identify genetic variants in all of your yeast strains concurrently (i.e. you should only be running freebayes
once will all samples, not for each sample separately). It will output results in Variant Call Format (.vcf
). You can read more about the VCF format here.
You should consider using the -f
, --genotype-qualities
, and -p
flags. You might like the -L
flag as well.
NOTE: For TAs, running this step took nearly 15 minutes. We expect this step will take a similar amount of time for you, and your computer might make a lot of noise.
Sometimes, freebayes
will call variants that may not be “real”. Luckily, variant callers generally report, for each variant, a “site quality”, that describes the probability of that site truly being polymorphic. You can use that quality score to filter out low quality (low confidence) variants.
Note that the “site quality” is not the same as the “genotype quality”, which describes the probability that a single sample’s genotype at some variant is correct. Both qualities, however, are generally reported on the “Phred” scale (more here).
You can filter out low quality variants using the vcffilter
tool (documentation here).
To use vcffilter
if you’re getting Illegal Instruction: 4 errors, get out of your conda environment, then run:
brew install brewsci/bio/vcflib
Filter your VCF using vcffilter
so that you only keep variants whose estimated probability of being polymorphic is greater than 0.99. You should consider how to do this with the -f
flag. Output your filtered variant calls to a new VCF file.
Sometimes, especially in regions with complex alignments between samples, you can run into cases where there are multiple possible alternative alleles at the same position. There’s nothing inherently “wrong” with this, but it does often complicate downstream analyses.
Luckily, you can use the vcfallelicprimitives
tool (documentation here) to decompose these more complex haplotypes into more manageable biallelic variants.
Use vcfallelicprimitives
to decompose complex haplotypes in your filtered VCF from Step 2.2. We suggest using the -k
and -g
flags to keep annotations for the variant sites and sample genotypes in your VCF. Output to a new VCF file.
Now that you’ve got these high-quality and nicely behaving variant calls, you want to know what impact these variants might have. Obviously, this is a huge question, and is the basis of all of genetics, but we can get a basic idea of their functional impact on nearby genes (e.g. are they missense, nonsense, etc.) using the snpEff
tool (the online documentation isn’t great, but running snpEff ann --help
should provide some useful information).
If it wasn’t obvious, snpEff
requires prior annotations (e.g. gene annotations) to work. Have snpeff
download its database of Saccharomyces cerevisiae annotations using the following command (we told you the NCBI ID would be relevant):
snpEff download R64-1-1.105
Finally, use snpEff ann
to annotate your VCF with the predicted functional effects that these genetic variants may have. Output to a new (and final) VCF.
For submission purposes, use head
to grab just the first 100 lines of your final VCF and store this in a new VCF. You will submit this “sample” VCF along with the rest of your assignment. YOU SHOULD NOT SUBMIT ANY OTHER VCFS; THEY ARE TOO BIG. Depending on how your .gitignore
is set up, you may need to do git add --force <yoursamplevcf.vcf>
.
Now that you’ve discovered variants in these strains and annotated their predicted functional effects, you’d like to do some basic exploratory analysis of the patterns you observe in the VCF. You’ll be using Python to create a single nicely formatted and labeled multi-panel plot (e.g., use subplots
with multiple rows and columns) that explores the following features of the data:
Each feature will be a single panel in your multi-panel plot.
Create an empty variation_analysis.py
script now where you’ll be doing the analyses in the next steps.
For these analyses, you’ll have to parse the filtered/biallelic VCF you generated in Step 2.4. If you can find a python library that handles VCF parsing, you’re welcome to use it, but it may be easier to simply use the following structure:
for line in open(<vcf_file_name>):
if line.startswith('#'):
continue
fields = line.rstrip('\n').split('\t')
# grab what you need from `fields`
Each of the following analyses needs information from a different field/column from the VCF. As such, you’ll be grabbing multiple pieces of information from each line, so if you do use this structure, instead of looping through the VCF separately for each analysis, try to use just a single loop across all analyses.
Remember, you can read more about the VCF file format here.
Plot a histogram showing the distribution of read depth at each variant across all samples (e.g. if you had 10 variants and 5 samples, you’d have 50 data points).
This information can be found in the sample specific FORMAT fields and the end of each line. Check the file header to decide which ID is appropriate.
Make sure you label the panel appropriately.
Plot a histogram showing the distribution of genotyping quality at each variant across all samples (as before, if you had 10 variants and 5 samples, you’d have 50 data points).
This information can be found in the sample specific FORMAT fields and the end of each line. Check the file header to decide which ID is appropriate. Remember, “genotype quality” is NOT the same as “site quality” and is a different part of the VCF line.
Make sure you label the panel appropriately.
Plot a histogram showing the allele frequency spectrum (distribution) of the variants in the VCF (this is a per-variant metric, so with 10 variants and 5 samples, you’d only have 10 data points).
This information is pre-calculated for you and can be found in the variant specific INFO field. Check the file header to decide which ID is appropriate.
Make sure you label the panel appropriately.
Create a barplot showing the predicted effect(s) of the variants in the VCF (i.e. a bar for each “type” of effect showing the number of variants with that effect).
In Python, produce a nicely formatted and labeled multi-panel plot (e.g., use subplots
with multiple rows and columns) describing your variants.
Explore each of the following characteristics of the variant genotypes called across all ten yeast samples. (Each characteristic will be a subplot in the multi-panel plot).
This information was added to the VCF by snpEff
and can be found in the variant specific INFO field. Check the file header to decide which ID is appropriate and how to parse the information.
NOTE: We encourage you to consider every possible effect for each variant, but feel free to just grab the first one.
Make sure you label the panel appropriately.
Total Points: 10