For this lab, you will be exploring Promoter-Capture Hi-C (pCHiC) data from the human cell line GM12878. This cell line is derived from B-cells and has been highly studied, meaning that there are many datasets available for it. All of the data for this assignment are mapped to the hg19 build of the human genome.
In this assignment, you will be running the software Chicago to analyze interaction counts to find significant promoter interactions. After calling significant interaction peaks in pCHiC data, you’ll visualize these interactions using the UCSC genome browser. You’ll also be using additional genome annotations from the GM12878 cell line to identify potential regulatory elements for relevant genes in this cell line.
To get the data, download the tarball from here and move it into your submission folder for this assignment. Then unpack it with the command
tar -xzf pchic_data.tar.gz
After unpacking, you should have an R script called runChicago.R
and a folder called raw
with three sub-folders, Design
, Features
, and PCHIC_data
.
There’s a lot of data in the raw
directory, but there are a few specific files that you will need for the assignment that we’ll describe here:
raw/Design/h19_chr20and21.rmap
contains the genomic coordinates and fragment number (just a label for the fragment) for each restriction digest fragment in your experiment.raw/Design/h19_chr20and21.baitmap
contains the genomic coordinates, fragment number, and gene name for each promoter bait fragment. This is essentially a subset of the h19_chr20and21.rmap
file, containing only those restriction digest fragments for which you had a corresponding bait in your pCHiC experiment.raw/PCHIC_Data/
contains the raw number of reads for each pair of fragments across 3 replicates. This file was generated by aligning the raw reads from the pCHiC experiment to the reference, and counting the number of reads aligning to each restriction digest fragment. chicago
will use this raw data to identify significant interactions.raw/Features/
contains a set of .narrowPeak
files from previous ChIP-seq experiments in this cell line. There are .narrowPeak
files for several different interesting regulatory features like CTCF and H3K4me3. You’ll be intersecting your pCHiC results with these regulatory features to learn more about how these features regulate long-range interactions. The raw/Features/featuresGM.txt
file lists the name of each regulatory feature and its corresponding .narrowPeak
file, and will be needed when running chicago
.You will need to use your Rosetta x86 version of iTerm for this assignment. Open Rosetta_iTerm.app
and run the following command to create and activate your conda environment:
mamba create -n chicago bioconductor-chicago r-argparser matplotlib
mamba activate chicago
You should now see the environment indicator (chicago)
to the left of your terminal prompt. Remember when you’re finished, to get out of this conda environment you can close the window or run the command mamba deactivate
.
Chicago
The data have already been aligned and assigned to specific restriction fragments. The next step is to determine which fragment pairs are co-occurring at a significantly higher frequency than expected. To do this, you will use the R package Chicago
. One of the things you downloaded was an R script called runChicago.R
, which is an executable version of the package. You can run the script using the command Rscript runChicago.R
.
runChicago.R
scriptTo get an idea of the parameters runChicago.R
is expecting, you can run Rscript runChicago.R --help
.
You’ll note that there are two required positional parameters:
input-files
: This should be the filepaths of read count files for each of your three replicate experiments in the raw/PCHIC_Data/
directory.output-prefix
: This should be the name of the output directory chicago
will create and write the output to.In addition to the two required parameters, there are three other arguments you’ll need to specify:
--design-dir
: This should be the path to the design directory that describes all of the restriction fragments in your experiment--en-feat-list
: This should be a file describing the filepaths of each of the regulatory features you want to intersect your pCHiC data with--export-format
: This should be the desired format of the output interactions. There are a bunch of different possible output formats, but the washU_text
format is the one that will probably be most useful to you.Run chicago
using these parameters. Record the command you used to run chicago
in your README.md
file; you’ll be submitting your README.md
as part of the assignment.
Once you’ve run this command you should have a folder with the output prefix you selected containing a variety of output files, which you’ll explore more now.
Inside the output folder there is a subfolder named enrichment_data
. This folder has two files (a .pdf
figure file and a .txt
file) that describe the enrichment of pCHiC interactions within the regulatory features you gave chicago
as input (the .narrowPeak
files).
The .txt
file has one row for each regulatory feature, and six columns. We’ll describe the first three columns here:
OLwithSI
column. You don’t need to worry about how chicago
determines “significant” interactions.Examine the feature overlaps and answer the following question in your README.md
: Do these enrichments make sense to you? Are any surprising? Explain your reasoning briefly for each feature.
The UCSC broswer is a great tool for visualzing interactions from pCHiC experiments such as these. Unfortunately, while chicago
does output a file of significant interactions, it’s in a format specific for the WashU genome browser, which we will not be using. You’ll need to convert the chicago
WashU format to a format the UCSC browser can use.
First, take a look at the chicago
output file, which should be <OUTNAME>/data/<OUTNAME>_washU_text.txt
where <OUTNAME>
is the output prefix you used when running Chicago
. You’ll see that this file has one line for each interaction and 3 columns:
Now, take a look at the format you will need to visualize the interactions in the UCSC browser here. You’ll note that this is essentially an extension of the .bed
format.
If this description of the necessary fields is confusing, you can use the hint below to check your understanding:
1. chrom: The chromsome of the interaction 2. chromStart: The start position of the interaction (i.e. the start position of the lower of the two fragments) 3. chromEnd: The end position of the interaction (i.e. the end position of the upper of the two fragments) 4. name: You won't need this so you can just use a `.` to mark that it's missing 5. score: An integer score (0-1000) that describes the strength of the interaction. This is set to help with visualiztion. We'll describe below how to generate this. 6. value: The strength of the interaction 7. exp: You won't need this so you can just use a `.` to mark that it's missing 8. color: Feel free to set a different color, but you can just use `0` to show the interactions in black 9. sourceChrom: The chromosome of the bait fragment 10. sourceStart: The start position of the bait fragment 11. sourceEnd: The end position of the bait fragment 12. sourceName: The name of the bait fragment (i.e. the name of the gene(s) for which the bait fragment is a marker) 13. sourceStrand: The "strand" of the bait fragment. You don't need this, but we recommend setting it to `+` to indicate that this is a bait fragment 14. targetChrom: The chromosome of the target fragment 15. targetStart: The start position of the target fragment 16. targetEnd: The end position of the target fragment 17. targetName: The name of the target fragment. If the target fragment is *also* a bait fragment, this should be the name of the fragment, otherwise you can mark it as empty with a `.` 18. targetStrand: The "strand" of the target fragment. Again, you don't need this, but we recommend setting it to `+` if the target is also a bait fragment, and `-` if it is not. It will help later on.
Now that you understand the two file formats, you’ll have to write a Python script that will convert the WashU format to the UCSC interaction bed format.
Create an empty convert_washU_to_UCSC.py
script. You’ll be submitting this script as part of the assignment.
This script should take three inputs:
.baitmap
file that describes each of the bait fragments<OUTNAME>_washU_text.txt
file with WashU format interactionsYour script should do the following:
track type=interact name="pCHIC" description="Chromatin interactions" useScore=on maxHeightPixels=200:100:50 visibility=full
.baitmap
file to determine, for each interaction, which fragment is the bait (or if they are both baits) and what the corresponding genes aremax
interaction strength, and multiplying by 1000. This should be an integer.
score = int(streng / max_strength) * 1000
Using your newly created UCSC interaction bed file, find the following:
README.md
file for this assignment.README.md
file for this assignment. Make sure it’s clear which set of 6 is which in your README.md
You can load a session in the UCSC genome browser with this link. It has a variety of annotation tracks for GM12878 already loaded. In order to add your custom track, use the button “add custom tracks” under the browser display and then above the “Paste URLs or data” field, use the button “Choose File” to select the interaction bed file you created.
Select at least two of the top 6 promoter-enhancer interactions you identified in Step 2.2 and navigate to the associated gene in the UCSC genome browser. You may want to click on your custom track in the track selection section and set the minimum score to somewhere between 300-500 to reduce the number of interactions shown. Find likely enhancer-promoter interactions based on the interaction track and regulatory feature annotations.
For each of the two chosen interactions, do the following:
In addition to finding likely enhancers for your target gene, look it up on the NCBI website. You may need to click on the gene link once you search to get all of the information. Specifically, look at the distribution of expression across tissues.
In your README.md
for this assignment, answer the following question: Does it make sense for this gene to be interacting with enhancers in GM12878? Explain.
README.md
file with commands and answers to questions (5 points total)
chicago
in step 1.1 (1 point)convert_washU_to_UCSC.py
script from step 2.1 (3 points)convert_washU_to_UCSC.py
(1 point)
git add --force
to get around this. Make sure you actually add it to your commit.Total Points: 10
Write a script that takes in the raw data and significant interaction file as well as a target gene name and plots all of the interactions for that gene as a scatterplot of distance from fragment centers between the promoter bait and other fragment by the number of reads. Color the points by the significance, black to red, for lowest to highest, respectively. For interactions not in the significant interaction file, give them a score of zero. You can limit the x-axis range to +/- 1Mb. Plot interaction profiles for at least two genes in your top 6 lists.