The 3D Genome

Assignment Overview

For this lab, you will be exploring Promoter-Capture Hi-C (pCHiC) data from the human cell line GM12878. This cell line is derived from B-cells and has been highly studied, meaning that there are many datasets available for it. All of the data for this assignment are mapped to the hg19 build of the human genome.

In this assignment, you will be running the software Chicago to analyze interaction counts to find significant promoter interactions. After calling significant interaction peaks in pCHiC data, you’ll visualize these interactions using the UCSC genome browser. You’ll also be using additional genome annotations from the GM12878 cell line to identify potential regulatory elements for relevant genes in this cell line.

Data

To get the data, download the tarball from here and move it into your submission folder for this assignment. Then unpack it with the command

tar -xzf pchic_data.tar.gz

After unpacking, you should have an R script called runChicago.R and a folder called raw with three sub-folders, Design, Features, and PCHIC_data.

There’s a lot of data in the raw directory, but there are a few specific files that you will need for the assignment that we’ll describe here:

raw/Design/h19_chr20and21.rmap contains the genomic coordinates and fragment number (just a label for the fragment) for each restriction digest fragment in your experiment.
raw/Design/h19_chr20and21.baitmap contains the genomic coordinates, fragment number, and gene name for each promoter bait fragment. This is essentially a subset of the h19_chr20and21.rmap file, containing only those restriction digest fragments for which you had a corresponding bait in your pCHiC experiment.
raw/PCHIC_Data/ contains the raw number of reads for each pair of fragments across 3 replicates. This file was generated by aligning the raw reads from the pCHiC experiment to the reference, and counting the number of reads aligning to each restriction digest fragment. chicago will use this raw data to identify significant interactions.
raw/Features/ contains a set of .narrowPeak files from previous ChIP-seq experiments in this cell line. There are .narrowPeak files for several different interesting regulatory features like CTCF and H3K4me3. You’ll be intersecting your pCHiC results with these regulatory features to learn more about how these features regulate long-range interactions. The raw/Features/featuresGM.txt file lists the name of each regulatory feature and its corresponding .narrowPeak file, and will be needed when running chicago.

Setting up your analysis environment

You will need to use your Rosetta x86 version of iTerm for this assignment. Open Rosetta_iTerm.app and run the following command to create and activate your conda environment:

mamba create -n chicago bioconductor-chicago r-argparser matplotlib
mamba activate chicago

You should now see the environment indicator (chicago) to the left of your terminal prompt. Remember when you’re finished, to get out of this conda environment you can close the window or run the command mamba deactivate.

Exercises

Exercise 1: Running `Chicago`

The data have already been aligned and assigned to specific restriction fragments. The next step is to determine which fragment pairs are co-occurring at a significantly higher frequency than expected. To do this, you will use the R package Chicago. One of the things you downloaded was an R script called runChicago.R, which is an executable version of the package. You can run the script using the command Rscript runChicago.R.

Step 1.1: Running the `runChicago.R` script

To get an idea of the parameters runChicago.R is expecting, you can run Rscript runChicago.R --help.

You’ll note that there are two required positional parameters:

input-files: This should be the filepaths of read count files for each of your three replicate experiments in the raw/PCHIC_Data/ directory.
output-prefix: This should be the name of the output directory chicago will create and write the output to.

In addition to the two required parameters, there are three other arguments you’ll need to specify:

--design-dir: This should be the path to the design directory that describes all of the restriction fragments in your experiment
--en-feat-list: This should be a file describing the filepaths of each of the regulatory features you want to intersect your pCHiC data with
--export-format: This should be the desired format of the output interactions. There are a bunch of different possible output formats, but the washU_text format is the one that will probably be most useful to you.

Run chicago using these parameters. Record the command you used to run chicago in your README.md file; you’ll be submitting your README.md as part of the assignment.

Once you’ve run this command you should have a folder with the output prefix you selected containing a variety of output files, which you’ll explore more now.

Step 1.2: Feature enrichment

Inside the output folder there is a subfolder named enrichment_data. This folder has two files (a .pdf figure file and a .txt file) that describe the enrichment of pCHiC interactions within the regulatory features you gave chicago as input (the .narrowPeak files).

The .txt file has one row for each regulatory feature, and six columns. We’ll describe the first three columns here:

The name of the regulatory feature. This column does not have a column header.
The number of “target” fragments (i.e. the non-bait fragment from an interaction) from significant interactions that overlap that feature. This is the OLwithSI column. You don’t need to worry about how chicago determines “significant” interactions.
The mean number of random fragments that overlap that feature. This mean is generated by 1) determining the number of “significant” interactions, 2) selecting that many random fragments, 3) calculating the number that overlap the feature, and 4) repeating steps 2 and 3 multiple times and taking the mean number of overlaps. This essentially gives you an idea of the number of overlaps expected by chance.

Examine the feature overlaps and answer the following question in your README.md: Do these enrichments make sense to you? Are any surprising? Explain your reasoning briefly for each feature.

Exercise 2: Visualzing interactions on the UCSC browser

The UCSC broswer is a great tool for visualzing interactions from pCHiC experiments such as these. Unfortunately, while chicago does output a file of significant interactions, it’s in a format specific for the WashU genome browser, which we will not be using. You’ll need to convert the chicago WashU format to a format the UCSC browser can use.

First, take a look at the chicago output file, which should be <OUTNAME>/data/<OUTNAME>_washU_text.txt where <OUTNAME> is the output prefix you used when running Chicago. You’ll see that this file has one line for each interaction and 3 columns:

A comma-separated list denoting the location of one of the fragments of the interaction. This could be either the bait or the target fragment
A comma-eparated list denoting the location of the other fragment of the interaction. Again, this could be either the bait or the target fragment
The “strength” of the interaction, which is based on the number of reads from your pCHiC experiment supporting that interaction

Now, take a look at the format you will need to visualize the interactions in the UCSC browser here. You’ll note that this is essentially an extension of the .bed format.

If this description of the necessary fields is confusing, you can use the hint below to check your understanding:

CLICK HERE FOR A DESCRIPTION OF THE FIELDS

1. chrom: The chromsome of the interaction
2. chromStart: The start position of the interaction (i.e. the start position of the lower of the two fragments)
3. chromEnd: The end position of the interaction (i.e. the end position of the upper of the two fragments)
4. name: You won't need this so you can just use a `.` to mark that it's missing
5. score: An integer score (0-1000) that describes the strength of the interaction. This is set to help with visualiztion. We'll describe below how to generate this.
6. value: The strength of the interaction
7. exp: You won't need this so you can just use a `.` to mark that it's missing
8. color: Feel free to set a different color, but you can just use `0` to show the interactions in black
9. sourceChrom: The chromosome of the bait fragment
10. sourceStart: The start position of the bait fragment
11. sourceEnd: The end position of the bait fragment
12. sourceName: The name of the bait fragment (i.e. the name of the gene(s) for which the bait fragment is a marker)
13. sourceStrand: The "strand" of the bait fragment. You don't need this, but we recommend setting it to `+` to indicate that this is a bait fragment
14. targetChrom: The chromosome of the target fragment
15. targetStart: The start position of the target fragment
16. targetEnd: The end position of the target fragment
17. targetName: The name of the target fragment. If the target fragment is *also* a bait fragment, this should be the name of the fragment, otherwise you can mark it as empty with a `.`
18. targetStrand: The "strand" of the target fragment. Again, you don't need this, but we recommend setting it to `+` if the target is also a bait fragment, and `-` if it is not. It will help later on.

Now that you understand the two file formats, you’ll have to write a Python script that will convert the WashU format to the UCSC interaction bed format.

Step 2.1: Convert WashU to UCSC

Create an empty convert_washU_to_UCSC.py script. You’ll be submitting this script as part of the assignment.

This script should take three inputs:

The filepath to the .baitmap file that describes each of the bait fragments
The filepath to the <OUTNAME>_washU_text.txt file with WashU format interactions
A filepath for an output interaction bed format file

Your script should do the following:

To make sure your output file properly displays, add the following line at the top of your output:
- track type=interact name="pCHIC" description="Chromatin interactions" useScore=on maxHeightPixels=200:100:50 visibility=full
- Note that you do NOT need to have the column names as an additional header
Use the .baitmap file to determine, for each interaction, which fragment is the bait (or if they are both baits) and what the corresponding genes are
For each interaction, determine an interaction “score” (used for visualization), by dividing the interaction strength by the max interaction strength, and multiplying by 1000. This should be an integer.
- In pseudocode, score = int(streng / max_strength) * 1000
Using the information from steps 2 and 3, reformat the input WashU-formatted data to the UCSC interaction bed format

Step 2.2: Finding the most significant interactions

Using your newly created UCSC interaction bed file, find the following:

The 6 top-scoring interactions between two promoters (i.e. between two bait fragments). Record these six lines in your README.md file for this assignment.
The 6 top-scoring interactions between a promoter and an enhancer (i.e. between one bait fragment and and non-bait fragments). Record these six lines in your README.md file for this assignment. Make sure it’s clear which set of 6 is which in your README.md

Step 2.3: Visualizing top interactions

You can load a session in the UCSC genome browser with this link. It has a variety of annotation tracks for GM12878 already loaded. In order to add your custom track, use the button “add custom tracks” under the browser display and then above the “Paste URLs or data” field, use the button “Choose File” to select the interaction bed file you created.

Select at least two of the top 6 promoter-enhancer interactions you identified in Step 2.2 and navigate to the associated gene in the UCSC genome browser. You may want to click on your custom track in the track selection section and set the minimum score to somewhere between 300-500 to reduce the number of interactions shown. Find likely enhancer-promoter interactions based on the interaction track and regulatory feature annotations.

For each of the two chosen interactions, do the following:

Focus the browser range to only include both ends of these interactions
Save an image (if you two-finger click on the browser window, a menu pops up with the option “View image”. This will open a new tab which you can two-finger click and select “Save Image As…”). You will submit this image for the assignment.

In addition to finding likely enhancers for your target gene, look it up on the NCBI website. You may need to click on the gene link once you search to get all of the information. Specifically, look at the distribution of expression across tissues.

In your README.md for this assignment, answer the following question: Does it make sense for this gene to be interacting with enhancers in GM12878? Explain.

Submission

README.md file with commands and answers to questions (5 points total)
- Command to run chicago in step 1.1 (1 point)
- Answer to question in step 1.2 (1 point)
- List of top 6 promoter-promoter interactions from step 2.2 (1 point)
- List of top 6 promoter-enhancer interactions from step 2.2 (1 point)
- Answer to question in step 2.3 (1 point)
Your convert_washU_to_UCSC.py script from step 2.1 (3 points)
The output interaction bed file from running convert_washU_to_UCSC.py (1 point)
- git will complain about adding this since it is a bed file which is in your .gitignore list. You can use git add --force to get around this. Make sure you actually add it to your commit.
The two UCSC genome browser screen shots of your selected promoter-enhancer interactions (1 point)

Total Points: 10

Advanced Exercise 1: Plotting top interactions (OPTIONAL)

Write a script that takes in the raw data and significant interaction file as well as a target gene name and plots all of the interactions for that gene as a scatterplot of distance from fragment centers between the promoter bait and other fragment by the number of reads. Color the points by the significance, black to red, for lowest to highest, respectively. For interactions not in the significant interaction file, give them a score of zero. You can limit the x-axis range to +/- 1Mb. Plot interaction profiles for at least two genes in your top 6 lists.