Assignment Date: Friday, Oct. 6, 2023
Due Date: Friday, Oct. 13, 2023 @ 10:00pm ET
The aim of this assignment is to have you create a basic peak-finding script. This will build on the interactive coding during lecture.
As usual, create a folder for this assignment in your qbb2023-answers repo and create a README.md file in that folder for written answers.
You will be using ChIP-seq data generated by the ENCODE project for the architectural protein CTCF in the Drosophila melanogaster cell line S2. There are 6 datasets, 3 pairs for two replicate samples and a control sample. You will also use the position frequency matrix (PFM) of dCTCF from the JASPAR transcription factor binding profile database.
To get the data, download it with the following command:
wget https://www.dropbox.com/scl/fi/4e6ijpve63dpjmoslz43u/chipseq_data.tar.gz?rlkey=neyal861rpxnkolnk2lxe5i7k -O chipseq_data.tar.gz
Next, unpack the datasets.
tar xzf chipseq_data.tar.gz
You should now have 10 files:
ChIP-seq data were mapped against the dm6 genome and were preprocessed to pull out the end location of each read, restricted to chromosome chr2R, partitioning them by strand, and creating pile-ups in bedgraph format. There is also a bed file containing all peak calls from chr2R, as your analysis will be restricted to a smaller region for processing time considerations.
There are also two python scripts to serve as a framework for how to approach the assignment. They outline the general steps in comments and include a couple of functions you will need to accomplish each task.
Using the estimated fragment size determined during the live coding portion of class and the bedgraph files for each sample and the control, determine regions with p-values <= 1e-10 given the local background under the poisson distribution. You will be using the find_peaks.py
script as your framework for this portion. Your script should process one sample at a time and you will run it for each sample. Restrict your analysis to the region chr2R:10,000,000-12,000,000. Your script should produce the following outputs:
Because you have replicates, you can boost the confidence of your peak calls to those appearing in both samples. To do this, you will be using bedtools
, a program that performs genomic arithmetic. To get a list of peaks appearing in both samples, use the following command:
bedtools intersect -a sample1_peaks.bed -b sample2_peaks.bed > combined_peaks.bed
Obviously you will need to change the file names if you use a different naming scheme for your peak calling output.
Question: What fraction of peaks were retained when intersected compared to sample1? Sample2?
IGV is short for Integrative Genomics Viewer. It is like a light-weight genome browser you can run on your own computer. To open it, type igv
in the terminal and hit enter. A browser should open.
Drosophila melanogaster (dm6)
from the drop-down menu in the upper left.Question: How reproducible are the peaks called between the two samples? Is the p-value range of a peak indicative of reproducibility? Is it completely consistent?
For this assignment you should submit three things:
Your IGV browser image (1 point)
Using ‘score_peak_seqs.py’ as a framework, write a script that loads in the sequence for chromosome chr2R and the combined_peak bed file as well as the CTCF position frequency matrix. For each peak, score each k-mer (where k is the length of the sequence) and identify the best score and k-mer for each sequence. Your script should output a graph with the following subplots:
Rerun the script using the all_combined_peaks.bed file to produce a second plot.
Questions:
Rerun your find_peaks script but using the whole chromosome chr2R. This will give you a wiggle file with p-values covering the whole chromosome. Using the all_combined_peaks bed file and the p-value wiggle track you just generated, extract peak sequences with a max p-value of at least 1e-250. Save these sequences in a fasta file.
Create a mamba environment with the program meme
:
mamba create -n meme meme
mamba activate meme
Run meme
with a minimum motif width of 8, a maximum width of 12, zoops mode, and search for 4 motifs. Make sure you also specify that it is DNA and allow for reverse complement motifs. You can speed this up by using the -p parameter to run it with more cores. Look at the resulting meme.html
file. Were you able to find the CTCF motif? Submit your motif logo and add the e-value of your motif to your README.md file.
Don’t forget to deactivate your mamba environment with mamba deactivate
.