Assignment Date: Friday, Oct. 8, 2021
Due Date: Friday, Oct. 15, 2021 @ 1pm ET
Live-coding jupyter notebook: wget https://github.com/bxlab/qbb2021/raw/main/week5/ChIPseq%20and%20Motif%20finding%20-%20Live.ipynb
Live-coding markdown: wget https://github.com/bxlab/qbb2021/raw/main/week5/ChIPseq_live_coding.md
For this lab, we are going to use some datasets from the 2014 study, “A comparative encyclopedia of DNA elements in the mouse genome” and the 2011 study, “Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration”.
In both of these works, the authors use a cell line called G1E-ER4 derived from Gata1-null mouse ES cells and expressing an estrogen-activated Gata1-estrogen receptor (ER) transgene. Thus, these cells undergo erythroid differentiation when treated with estradiol.
A number of transcription factors and histone modifications have been assayed for both cell states. Here we are going to consider CTCF from the 2014 study.
Datasets before and after differentiation (labeled G1E and ER4 respectively) should be preloaded on your laptops at /Users/cmdb/data/rawdata/g1e.tar.xz
. If you need to get a new copy, you can download the data as follows:
wget https://bx.bio.jhu.edu/data/msauria/cmdb-lab/g1e.tar.xz
Next, unpack the datasets.
tar xzf g1e.tar.xz
You should now have 4 files:
These datasets include only data for chromosome 19, with both treatment and control data for each cell state.
First, we need to map our reads back to the genome. I suggest using the bowtie2
aligner for this purpose. You already have the mouse genome indexed for bowtie, bowtie2, and bwa on your laptop in /Users/cmdb/data/indexes
. Map each set of reads against the reference genome.
Note: Are these paired or unpaired .fastq
files? Make sure to align each individual sample separately to the reference genome.
We will use macs2
to call peaks which is preloaded on your computer. macs2
has several modes of operation, but the one we will use is called callpeak
.
Remember that you do not need to provide all of the optional parameters. For macs2
, we definitely want to provide the target and control (input) samples. It is also typically important to provide a parameter for the effective genome size. Otherwise, the defaults should work reasonably well here. If the enrichment by our antibody was poor, we might need to consider the mfold
and extsize
parameters (but more likely, back to the lab).
macs2
will write several output files to a directory that you specify. The file containing the peaks has the extension narrowPeak
. This is an extended BED format, where the first six fields are standard BED.
macs2
to produce a list of peaks for each condition in BED format.bedtools
, which is useful for manipulating and overlapping BED files.The file Mus_musculus.GRCm38.94_features.bed is derived from the 2011 study and contains a partitioning of the mouse genome into functional regions based on Ensembl gene annotations (exon, intron, promoter, etc).
Hint: We recommend using command line tools to prepare your bedtools output for plotting. This way, the only Python coding you need to do is directly reading in the files and generating the plots.
.bed
files: ER4 peaks, G1E peaks, sites lost during differentiation (G1E only), and sites gained during differentiation (ER4 only)We will now attempt to discover the sequence that CTCF prefers to bind in these cells. For this we will use MEME, which is a suite of many programs. Again, this should already be installed on your computer. For this specific analysis, we will use the program meme-chip
.
You should use your ER4 .narrowPeak
file for this section.
Also, we want to be able to compare against known motifs. The MEME motif databases have already been loaded on your computer in /Users/cmdb/data/meme_db/motifs/motif_databases.12.21.tgz
.
Motif finding is computationally intensive, so we want to enrich for signal as much as possible. I would suggest passing only the 100 strongest ChIP-seq peaks to MEME (you can pass more, but it will take more time, and passing weak peaks will make the motif more difficult to find). You will need to get the sequences corresponding to the peaks you have found. The mouse genome FASTA file is already on your computer at /Users/cmdb/data/genomes/mm10.fa
.
samtools
to extract sequence from a FASTA file.meme-chip
, perform motif finding in the strongest 100 peaks from the ER4 state. Consider motif widths up to 20bp (-maxw).tomtom
(part if the meme suite) to find matches to known motifs. You will need to unpack the database first. It is lots of files, so put it in a folder first. Also, when running tomtom
, you can (and should) specify multiple motifs to compare against. You can select all of the JASPAR motifs with MA*
.wget http://jaspar2018.genereg.net/download/CORE/JASPAR2018_CORE_non-redundant_pfms_meme.zip
unzip JASPAR2018_CORE_non-redundant_pfms_meme.zip
tomtom
results file (the .html file) in your web browser. Identify the best matching motif to a motif found by you using meme
. Save the logo comparison.