ChIP-seq & Motif Finding

Assignment Date: Friday, Oct. 6, 2023
Due Date: Friday, Oct. 13, 2023 @ 10:00pm ET

Lecture

Assignment Overview

The aim of this assignment is to have you create a basic peak-finding script. This will build on the interactive coding during lecture.

Setup

As usual, create a folder for this assignment in your qbb2023-answers repo and create a README.md file in that folder for written answers.

Data

You will be using ChIP-seq data generated by the ENCODE project for the architectural protein CTCF in the Drosophila melanogaster cell line S2. There are 6 datasets, 3 pairs for two replicate samples and a control sample. You will also use the position frequency matrix (PFM) of dCTCF from the JASPAR transcription factor binding profile database.

To get the data, download it with the following command:

wget https://www.dropbox.com/scl/fi/4e6ijpve63dpjmoslz43u/chipseq_data.tar.gz?rlkey=neyal861rpxnkolnk2lxe5i7k -O chipseq_data.tar.gz

Next, unpack the datasets.

tar xzf chipseq_data.tar.gz

You should now have 10 files:

sample1.fwd.bg
sample1.rev.bg
sample2.fwd.bg
sample2.rev.bg
control.fwd.bg
control.rev.bg
CTCF.pfm
all_combined_peaks.bed
find_peaks.py
score_peak_seqs.py

ChIP-seq data were mapped against the dm6 genome and were preprocessed to pull out the end location of each read, restricted to chromosome chr2R, partitioning them by strand, and creating pile-ups in bedgraph format. There is also a bed file containing all peak calls from chr2R, as your analysis will be restricted to a smaller region for processing time considerations.

There are also two python scripts to serve as a framework for how to approach the assignment. They outline the general steps in comments and include a couple of functions you will need to accomplish each task.

Step 1: Calling peaks

Using the estimated fragment size determined during the live coding portion of class and the bedgraph files for each sample and the control, determine regions with p-values <= 1e-10 given the local background under the poisson distribution. You will be using the find_peaks.py script as your framework for this portion. Your script should process one sample at a time and you will run it for each sample. Restrict your analysis to the region chr2R:10,000,000-12,000,000. Your script should produce the following outputs:

A wiggle file with base-pair resolution -log10-transformed p-values
A bed file with all regions with p-values <= 1e-10

Step 2: Intersecting peaks

Because you have replicates, you can boost the confidence of your peak calls to those appearing in both samples. To do this, you will be using bedtools, a program that performs genomic arithmetic. To get a list of peaks appearing in both samples, use the following command:

bedtools intersect -a sample1_peaks.bed -b sample2_peaks.bed > combined_peaks.bed

Obviously you will need to change the file names if you use a different naming scheme for your peak calling output.

Question: What fraction of peaks were retained when intersected compared to sample1? Sample2?

Step 3: Visualizing the data with IGV

IGV is short for Integrative Genomics Viewer. It is like a light-weight genome browser you can run on your own computer. To open it, type igv in the terminal and hit enter. A browser should open.

Select the correct genome, Drosophila melanogaster (dm6) from the drop-down menu in the upper left.
You will add the wiggle and bed tracks you generated in the previous step for each sample. To do this, select File->Load from file in the top menu bar. Select your datasets (you can select multiple datasets at the same time). These should now load in the browser.
Load the combined_peak bed file you created using the same procedure
Navigate to the correct genomic region. You can select the chromosome from a drop-down menu next to the genome menu at the top. To zoom in, you can use the plus sign button in the upper right or click and drag to highlight a region in the coordinate bar. You can quickly move by dragging the red rectangle over the ideogram. You can also directly type in the region to go to in the text box next to the chromosome selector.
Format your wiggle tracks. Two-finger clicking on the track name on the left brings up a menu of options. I suggest adjusting the “track height” (100 seems reasonable) and the “set data range” (make the high value ~20 to see the lower signal).
Find a nice view and save an image by going to File->Save image… from the menu bar at the top.

Question: How reproducible are the peaks called between the two samples? Is the p-value range of a peak indicative of reproducibility? Is it completely consistent?

Submission

For this assignment you should submit three things:

Your completed find_peaks.py script (7 points total)
- Load in bedgraph data (1 point)
- Combine tag densities (1 point)
- Adjust control coverage (1 point)
- Calculate background (1 point)
- Score sample (1 point)
- Calculate and log transform p-values (1 point)
- Output data (1 point)
Your IGV browser image (1 point)
A README.md file with answers to the questions in steps 2 and 3 (2 points total)
- Answer to step 2 (1 point)
- Answer to step 3 (1 point)

Advanced exercises

Scoring sequences using a position weight matrix

Using ‘score_peak_seqs.py’ as a framework, write a script that loads in the sequence for chromosome chr2R and the combined_peak bed file as well as the CTCF position frequency matrix. For each peak, score each k-mer (where k is the length of the sequence) and identify the best score and k-mer for each sequence. Your script should output a graph with the following subplots:

A histogram of the best score for each peak
An image of the CTCF position frequency matrix (using imshow)
An image of the position frequency matrix produced by combining the best k-mer from each peak

Rerun the script using the all_combined_peaks.bed file to produce a second plot.

Questions:

Do the score distributions look similar between the small and large samples?
Can you infer anything about the preferred binding strength of under natural selection?
What percentage of peaks had a binding site with a score of at least 1?

Finding motifs

Rerun your find_peaks script but using the whole chromosome chr2R. This will give you a wiggle file with p-values covering the whole chromosome. Using the all_combined_peaks bed file and the p-value wiggle track you just generated, extract peak sequences with a max p-value of at least 1e-250. Save these sequences in a fasta file.

Create a mamba environment with the program meme:

mamba create -n meme meme
mamba activate meme

Run meme with a minimum motif width of 8, a maximum width of 12, zoops mode, and search for 4 motifs. Make sure you also specify that it is DNA and allow for reverse complement motifs. You can speed this up by using the -p parameter to run it with more cores. Look at the resulting meme.html file. Were you able to find the CTCF motif? Submit your motif logo and add the e-value of your motif to your README.md file.

Don’t forget to deactivate your mamba environment with mamba deactivate.