Assignment Date: Friday, Sept. 17, 2021
Due Date: Friday, Sept. 24, 2021 @ 1:00pm ET
Slides are available here: 0210914_qblab_variant_calling.pptx
Today we will perform de novo identification of variants in multiple haploid yeast strains. These strains are the progeny of a cross between a lab strain and a wine strain. The data come from Finding the sources of missing heritability in a yeast cross.
If you do not write a bash script, keep track of all your command-line code in a txt
or md
file.
Push all scripts (Python and/or command line), your txt
or md
file (if applicable), a nicely formatted multi-panel plot, and ONLY the first 1000 lines of your filtered, decomposed, and annotated VCF to your qbb2021-answers
Github repository. Do not push any other raw data to Github, and do not push the full VCF!
The following zip file contains ten sets of single-end Illumina sequencing reads, each for a different yeast strain.
wget "https://github.com/bxlab/qbb2021/raw/main/week2/BYxRM_subset.tar.xv"
tar -xvzf BYxRM_subset.tar.xv
You will be aligning reads from your yeast samples to the Saccharomyces cerevisiae reference genome. This reference is called sacCer3 by the UCSC genome browser, but its name in the NCBI Assembly archive is R64-1-1. You need this info later for snpEff.
When you unzip the reference genome file from UCSC, it will have separate fasta files for each chromosome. You should combine these files into a single whole-genome reference by using the code below:
wget "http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/chromFa.tar.gz"
tar -xvzf chromFa.tar.gz
cat chr*.fa > sacCer3.fa
rm chr*.fa
Your task now is to identify genetic variants in these samples and annotate them with their predicted functional effects.
Since several parts of this assignment will involve running the same code on each of your 10 yeast samples, you may want to write a bash script that will automate this process for you. However, this is not required.
bwa index
Before you can align your sequencing reads, you first need to index the sacCer3 genome.
bwa mem
Align your reads against the sacCer3 reference genome.
IT IS VERY IMPORTANT that you assign each sample a read group during this process, so that individual samples can be distinguished in Step 4. You can do this with the (somewhat cryptic) -R
flag, which you use to add a line to the header of each output alignment file. An example of a header line you can add with the -R
flag is "@RG\tID:Sample1\tSM:Sample1"
.
Perhaps consider the -t
and -o
flags as well.
samtools
, for input to variant callersPerhaps consider the -O
and -o
flags.
freebayes
Use freebayes
to identify genetic variants in all of your yeast strains concurrently. It will output results in Variant Call Format (VCF). You should consider using the -f
, --genotype-qualities
, and -p
flags.
Note: This step will take a few minutes, and your computer might make a lot of noise.
vcffilter
Filter your VCF so that you only keep variants whose estimated probability of being polymorphic is greater than 0.99. You should consider how to do this with the -f
flag. The freebayes
documentation will be helpful here, as well as this vcffilter info.
vcfallelicprimitives
We suggest using the -k
and -g
flags to keep annotations for the variant sites and sample genotypes in your VCF.
You can reference vcfallelicprimitives documentation 1 and vcfallelicprimitives documentation 2.
snpeff ann
First, fetch the appropriate yeast reference database:
snpeff download R64-1-1.99
Then, use snpeff ann
to annotate your VCF with the predicted functional effects that these genetic variants may have.
We recommend not Googling the snpeff
documenation. It will tell you to use java -jar snpEff.jar
, which you should not. The help option for snpeff ann
’s command-line tool is 100 times better.
In Python, produce a nicely formatted and labeled multi-panel plot describing your variants.
Explore each of the following characteristics of the variant genotypes called across all ten yeast samples. (Each characteristic will be a subplot in the multi-panel plot).
You may find it helpful to reference this page of the snpeff
manual, which describes the format of its output VCF.
Push all scripts, a record of your command line commands (if applicable), your multi-panel plot, and ONLY the first 1000 lines of your filtered, decomposed, and annotated VCF to your qbb2021-answers
repo. Do not push any other raw data to Github, and do not push the full VCF!
“Got anything else?” you ask. Of course we do. If you have time, try your hand at the advanced exercises:
Push all scripts, a txt
or md
file with your answers, and plots to your qbb2021-answers
Github repository.
Q1a. How many 100bp reads are needed to sequence a 1Mbp genome to 5x coverage?
Q1c. Using the histogram from 1b, how much of the genome has not been sequenced (has 0x coverage)? How well does this match Poisson expectations?
ATTCA
ATTGA
CATTG
CTTAT
GATTG
TATTT
TCATT
TCTTA
TGATT
TTATT
TTCAT
TTCTT
TTGAT
Q2b. Using the k-mers from Q2a, ssume that the maximum number of occurrences of any 3-mer in the actual genome is 4. Write out one possible genome sequence.
Q2c. What would it take to fully resolve the genome?