Assignment Date: Friday, Sept. 10, 2021
Due Date: Thursday, Sept. 16, 2021 @ 11:59pm
Slides are available here: GenomeAssembly.pdf
In this assignment, you are given a set of unassembled reads from a mysterious pathogen that contains a secret message encoded someplace in the genome. The secret message will be recognizable as a novel insertion of sequence not found in the reference. Your task is to assess the quality of the reads, assemble the genome, identify, and decode the secret message. If all goes well the secret message should decode into a recognizable english text, otherwise doublecheck your coordinates and try again.
Some of the tools you will need to use only run in a linux or mac environment. Several have been pre-installed on your laptop. However, you will need to install 3 new packages. Please run the following commands:
conda install -c bioconda jellyfish
conda install spades mummer
For tips on how to run the tools that you need for this assignment, look in the Resources
section below.
Finally, keep track of all commands that you use in a text or markdown file. You will need to submit this along with any graphs or images.
Download the reads and reference genome from: https://github.com/bxlab/qbb2021/raw/main/week1/asm.tgz
Note I have provided both paired-end and mate-pairs reads (see included README for details). Make sure to look at all of the reads for the coverage analysis and kmer analysis, as well as in the assembly.
samtools faidx
]FastQC
]FastQC
]Use Jellyfish
to count the 21-mers in the reads data. Make sure to use the “-C” flag to count cannonical kmers,
otherwise your analysis will not correctly account for the fact that your reads come from either strand of DNA.
jellyfish histo
]jellyfish dump
along with sort
and head
]Assemble the reads using Spades
. Spades will not run on Windows you must use a linux or mac environment.
grep -c '>' contigs.fasta
]samtools faidx
, plus a short script if necessary]samtools faidx
plus sort -n
]Use MUMmer
for whole genome alignment.
dnadiff
]nucmer
and show-coords
]dnadiff
]show-coords
]show-coords
]samtools faidx
to extract the insertion]dna-decode.py
to decode the string from 5c.]The solutions to the above questions should be submitted as a markdown or text file on Github, in your qbb2020-answers
repo. Include any requested figures within the markdown file, or push them separately to Github. Make sure to clearly
label each of the subproblems and give the exact commands you used for solving the question.
$ fastqc /path/to/reads.fq
If you have problems, make sure java is installed (sudo apt-get install default-jre
)
Note Jellyfish requires a 64-bit operating system. Download the package and compile it like this:
$ jellyfish count -m 21 -C -s 1000000 /path/to/reads*.fq
$ jellyfish histo mer_counts.jf > reads.histo
GenomeScope is a web-based tool so there is nothing to install. Hooray! Just make sure to use the -C
when running jellyfish count so that the reads are correctly processed.
Normally spades would try several values of k and merge the results together, but here we will force it to just use k=31 to save time. The assembly should take a few minutes.
$ spades.py --pe1-1 frag180.1.fq --pe1-2 frag180.2.fq --mp1-1 jump2k.1.fq --mp1-2 jump2k.2.fq -o asm -t 4 -k 31
$ dnadiff /path/to/ref.fa /path/to/qry.fa
$ nucmer /path/to/ref.fa /path/to/qry.fa
$ show-coords out.delta
WARNING: nucmer and related tools do not like it if/when you have spaces or special characters (‘@’) in the path to the binaries*
$ ./samtools faidx /path/to/genome.fa contig_id:1234-5678
9-17-2020