Assignment 5: ChIP-seq and Motif Finding

Assignment Date: Friday, Oct. 8, 2021
Due Date: Friday, Oct. 15, 2021 @ 1pm ET

Lecture

Live-coding jupyter notebook: wget https://github.com/bxlab/qbb2021/raw/main/week5/ChIPseq%20and%20Motif%20finding%20-%20Live.ipynb

Live-coding markdown: wget https://github.com/bxlab/qbb2021/raw/main/week5/ChIPseq_live_coding.md

Homework

Part 1: ChIP-seq

Data

For this lab, we are going to use some datasets from the 2014 study, “A comparative encyclopedia of DNA elements in the mouse genome” and the 2011 study, “Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration”.

In both of these works, the authors use a cell line called G1E-ER4 derived from Gata1-null mouse ES cells and expressing an estrogen-activated Gata1-estrogen receptor (ER) transgene. Thus, these cells undergo erythroid differentiation when treated with estradiol.

A number of transcription factors and histone modifications have been assayed for both cell states. Here we are going to consider CTCF from the 2014 study.

Datasets before and after differentiation (labeled G1E and ER4 respectively) should be preloaded on your laptops at /Users/cmdb/data/rawdata/g1e.tar.xz. If you need to get a new copy, you can download the data as follows:

wget https://bx.bio.jhu.edu/data/msauria/cmdb-lab/g1e.tar.xz

Next, unpack the datasets.

tar xzf g1e.tar.xz

You should now have 4 files:

CTCF_ER4.fastq
CTCF_G1E.fastq
input_ER4.fastq
input_G1E.fastq

These datasets include only data for chromosome 19, with both treatment and control data for each cell state.

Mapping reads

First, we need to map our reads back to the genome. I suggest using the bowtie2 aligner for this purpose. You already have the mouse genome indexed for bowtie, bowtie2, and bwa on your laptop in /Users/cmdb/data/indexes. Map each set of reads against the reference genome.

Note: Are these paired or unpaired .fastq files? Make sure to align each individual sample separately to the reference genome.

Calling peaks

We will use macs2 to call peaks which is preloaded on your computer. macs2 has several modes of operation, but the one we will use is called callpeak.

Remember that you do not need to provide all of the optional parameters. For macs2, we definitely want to provide the target and control (input) samples. It is also typically important to provide a parameter for the effective genome size. Otherwise, the defaults should work reasonably well here. If the enrichment by our antibody was poor, we might need to consider the mfold and extsize parameters (but more likely, back to the lab).

macs2 will write several output files to a directory that you specify. The file containing the peaks has the extension narrowPeak. This is an extended BED format, where the first six fields are standard BED.

Run macs2 to produce a list of peaks for each condition in BED format.

Differential binding

Identify all locations where either CTCF binding is lost or CTCF binding is gained during differentiation. Write the result in BED format. This can be done quite easily using the program bedtools, which is useful for manipulating and overlapping BED files.

Feature overlapping

The file Mus_musculus.GRCm38.94_features.bed is derived from the 2011 study and contains a partitioning of the mouse genome into functional regions based on Ensembl gene annotations (exon, intron, promoter, etc).

Count how many of the CTCF binding sites in G1E and ER4 overlap with each feature.

Plot

Produce a two-panel plot containing two bar plots. The left panel should plot the number of CTCF binding sites in each type of region (exon, intron…) for each cell type. The right panel should plot the number of sites lost and gained during differentiation for each cell type.

Hint: We recommend using command line tools to prepare your bedtools output for plotting. This way, the only Python coding you need to do is directly reading in the files and generating the plots.

Submit

Record of command line commands for this section
Two-panel plot
Plotting script/notebook
4 .bed files: ER4 peaks, G1E peaks, sites lost during differentiation (G1E only), and sites gained during differentiation (ER4 only)

Part 2: Motif discovery

We will now attempt to discover the sequence that CTCF prefers to bind in these cells. For this we will use MEME, which is a suite of many programs. Again, this should already be installed on your computer. For this specific analysis, we will use the program meme-chip.

Data

You should use your ER4 .narrowPeak file for this section.

Also, we want to be able to compare against known motifs. The MEME motif databases have already been loaded on your computer in /Users/cmdb/data/meme_db/motifs/motif_databases.12.21.tgz.

Motif finding

Motif finding is computationally intensive, so we want to enrich for signal as much as possible. I would suggest passing only the 100 strongest ChIP-seq peaks to MEME (you can pass more, but it will take more time, and passing weak peaks will make the motif more difficult to find). You will need to get the sequences corresponding to the peaks you have found. The mouse genome FASTA file is already on your computer at /Users/cmdb/data/genomes/mm10.fa.

Hint: Remember that you can use samtools to extract sequence from a FASTA file.

Using meme-chip, perform motif finding in the strongest 100 peaks from the ER4 state. Consider motif widths up to 20bp (-maxw).
Scan these motifs against the (JASPAR CORE 2018)[http://jaspar2018.genereg.net/downloads/] database using tomtom (part if the meme suite) to find matches to known motifs. You will need to unpack the database first. It is lots of files, so put it in a folder first. Also, when running tomtom, you can (and should) specify multiple motifs to compare against. You can select all of the JASPAR motifs with MA*.

wget http://jaspar2018.genereg.net/download/CORE/JASPAR2018_CORE_non-redundant_pfms_meme.zip
unzip JASPAR2018_CORE_non-redundant_pfms_meme.zip

Examine the tomtom results file (the .html file) in your web browser. Identify the best matching motif to a motif found by you using meme. Save the logo comparison.

Submit

Record of command line commands for this section
Sequence logo plot

Advanced exercises

Produce a plot showing where motif matches occur in the input sequences. This should be a density plot, where the x-axis is the relative position in the sequences where the motif matches are found, and the y-axis is a representation of how often we see motifs at that position.