In today’s assignment, you will be simulating sequencing coverage data across a genome. The goal is to understand how probability distributions can inform the amount of coverage needed to reconstruct an entire genome. You will also be introduced to genome assembly, by constructing a de Bruijn graph assembly. This assembly will be visualized as a directed graph using graphviz
.
In your README.md
for this assignment, answer the following question (show your work):
Using Python, simulate sequencing 3x coverage of a 1Mbp genome with 100bp reads. Note that you do not need to actually simulate the sequences of the reads, you can just randomly sample positions in the genome and record the coverage. You do not need to consider the strand of each read.
The start position of each read should have a uniform random probabilty at each possible starting position (0 through 999,900). You can record the coverage in an array of 1M positions.
Once you have all of your reads simulated and recorded, save your coverages (there should be 1 million) into a text file which you can open in R for plotting.
num_reads = calculate_number_of_reads(genomesize, readlength, coverage)
## use an array to keep track of the coverage at each position in the genome
genome_coverage = initialize_array_with_zero(genomesize)
for i in range(len(num_reads)):
startpos = uniform_random(0,genomelength-readlength+1)
endpos = startpos + readlength
genomecoverage[startpos:endpos] += 1
## get the range of coverages observed
maxcoverage = max(genomecoverage)
xs = list(range(0, maxcoverage+1))
## Get the poisson pmf at each of these
poisson_estimates = get_poisson_estimates(xs, lambda = coverage)
## Get normal pdf at each of these (i.e. the density between each adjacent pair of points)
normal_estimates = get_normal_estimates(xs, mean = genome_coverage, stddev = sqrt(genome_coverage))
## now plot the histogram and probability distributions
...
Now using R, plot the histogram of coverage across the genome. Overlay the histogram with a Poisson distribution with lambda = 3. Also overlay the distribution with a Normal distribution with a mean of 3 and a std. dev. of 1.73 (which is the square root of 3).
scipy.stats.poisson.pmf()
function (more here) or the R dpois
function (more here). Unlike the Poisson distribution, the normal distribution is continuous, and so we will instead want to use the probability density function (PDF). Note that for both of these, this will give you the probability of observing each coverage. What do we need to do to transform these probabilities into a frequency count comparable to those in our histogram?Upload this plot as ex1_3x_cov.png
in your submission directory. All plots should be clearly labelled and easily interpretable (i.e. axis labels, legend describing the three things plotted, etc.).
Using your results from Step 1.3, answer the following questions in your README.md
:
Now, repeat the analysis with 10x coverage:
ex1_10x_cov.png
in your submission directory.README.md
, answer the following questions:
Now, repeat the analysis with 30x coverage:
ex1_30x_cov.png
in your submission directory.README.md
, answer the following questions:
Next, you’re going to generate your own de Bruijn graph using a provided set of reads. Copy the list of reads below into your code:
reads = ['ATTCA', 'ATTGA', 'CATTG', 'CTTAT', 'GATTG', 'TATTT', 'TCATT', 'TCTTA', 'TGATT', 'TTATT', 'TTCAT', 'TTCTT', 'TTGAT']
Write code to find all of the edges in the de Bruijn graph corresponding to the provided reads using k = 3 (assume all reads are from the forward strand, no sequencing errors, complete coverage of the genome). Each edge should be of the format ATT -> TTC
. Write all edges to a file, with each edge as its own line in the file.
graph = set()
for each read:
for i in range(len(read) - k):
kmer1 = read[i: i+k]
kmer2 = read[i+1: i+1+k]
add "kmer1 -> kmer2" to graph
for each edge in graph:
print edge
Now that we know all of the edges, we can go about actually visualizing the de Bruijn graph. For this task, we’ll be using the graphviz command line tool.
Create a conda
environment for this assignment and install graphviz
:
conda create -n graphviz -c conda-forge graphviz
conda activate graphviz
Graphviz’s command line tool is called dot
. Read more about how to use dot
and the file format it’s expecting here.
Based on what you’ve read, modify your code from Step 2.1 to output the edges in a format dot
can use.
Now, use dot
to produce a directed graph. Record the command you used in your READMD.md
. Upload this graph as ex2_digraph.png
in your submission directory. You do NOT need to upload the text file of edges you used to make the graph.
Assume that the maximum number of occurrences of any 3-mer in the actual genome is five. Using your graph from Step 2.4, write one possible genome sequence that would produce these reads. Record your answer in your README.md
.
In a few sentences, what would it take to accurately reconstruct the sequence of the genome? Record your answer in your README.md
.
Download the human chomosome 22 DNA sequence using the following command:
wget https://schatz-lab.org/appliedgenomics2023/assignments/assignment1/chr22.fa.gz
A kmer is a substring of length k. For example, the string GATTACA, has these 3-mers: GAT, ATT, TTA, TAC, ACA.
A string of length G has G - k + 1 kmers. For long strings, G - k + 1 is nearly the same as G e.g. for human using 19mers, 3,000,000,000 vs 2,999,999,986.
While a string of length G has G-k+1 kmers, there may be many fewer distinct kmers. For example, in the string “GCATCATCATCATCATCATCAT…” the kmers are: GCA, CAT, ATC, TCA, CAT, ATC, TCA, CAT, ATC, TCA, CAT, …. As such, there are only 4 disinct kmers (GCA, CAT, ATC, TCA). Of these, GCA occurs once and the others occur many times.
How many As, Cs, Gs, Ts and Ns are found in the entire chromosome? If needed, convert lowercase letters to uppercase. Any other character can be converted to N. Record your answer in your README.md
.
Using Python, tally the frequency of all of the different 19-mers in the chromosome, and calculate the kmer frequency spectrum up to 1000 (e.g. how many kmers occur 1 time, how many occur 2 times, how many occur 3 times, etc.). Store this in a list. For this, convert lowercase letters to uppercase, and convert any character that is not A, C, G or T to A (especially N characters). We recommend you use a dictionary to tally the frequencies using this pseudocode.
In your README.md
, show the kmer frequency spectrum for 1 to 20, e.g. how many kmers occur 1 time, how many occur 2, …, how many occur 20 times.
## initialize kmer length
k=19
## read genome, convert to upper case and convert non-DNA to 'A'
genome_string = read_from_file("chr22.fa")
## dictionary that maps a kmer (like GAT) to a frequency (like 3)
kmer_frequency = {}
## now scan the genome, extract kmers, and tally up their frequencies
for i in range(len(genome_string)-k+1):
kmer = genome_string[i, i+k]
kmer_frequency[kmer] = kmer_frequency[kmer] + 1
## now tally the frequencies in a dictionary that maps kmer frequency to count
## also determine the maximum kmer frequency
tally = {}
all_kmers = kmer_frequency.keys()
max_frequency = 0
for each kmer in all_kmers:
freq = kmer_frequency[kmer]
tally[freq] = tally[freq] + 1
if freq > max_frequency:
max_frequency = freq
freq_spectrum = []
## now print in sorted order
for i in range(1, max_frequency+1):
if i in tally:
freq = tally[i]
append freq to freq_spectrum
else:
append 0 to freq_spectrum
Using your list from Step 3.2, plot the kmer frequency spectrum: x-axis is the kmer frequency, and the y-axis is the number of kmers that occur at each of those frequencies times. Make sure to plot both the x and y-axis in log space. Upload this graph as ex4_kmer_spec.png
in your submission directory.
Using your list from Step 3.2, answer the following questions. Record your answers in your README.md
.
ex1_*_cov.png
(1 point)README.md
file with answers to questions in the assignment (3 points total)
ex1_3x_cov.png
clearly labelled (0.5 point)ex1_10x_cov.png
clearly labelled (0.5 point)ex1_30x_cov.png
clearly labelled (0.5 point)ex2_digraph.png
(just the output of dot
) (0.5 point)Total Points: 10