Assignment 6: 3D Genome

Assignment Date: Friday, Oct. 15, 2021
Due Date: Friday, Oct. 22, 2021 @ 1pm ET

Lecture

Live-coding blank notebook: wget https://github.com/bxlab/qbb2021/raw/main/week6/3D%20Genome%20Live.ipynb

Live-coding master notebook: wget https://github.com/bxlab/qbb2021/raw/main/week6/3D%20Genome%20Live%20-%20Master.ipynb

3D Genome

The goal of today’s lab is to learn how to analyze high-throughput chromosome conformation capture data. We will be using data from the ENCODE project to look at nuclear compartments and seeing how compartmentalization corresponds to gene expression and H3K27me3. The Hi-C data has already been mapped and normalized. The next step is to call compartments using the tool suite cooltools.

Installing CoolTools

To install CoolTools, we will be creating a self-contained environment using Conda. This means that we don’t have to worry about what else is already installed, it won’t interfere. To do this, we need to create the environment and then populate it with CoolTool and its dependencies:

conda create -n hic cooltools cooler matplotlib numpy bedtools ucsc-bigWigToBedGraph jupyter
conda activate hic

Now we can use the cooltools python library for writing scripts for manipulating Hi-C data.

Getting data

All of the data we will be using is located here in a tarball. Download it. To extract it, change to the directory you want to work in and run the command tar xzf 3Dgenome_encode.tar.gz. You should have 4 files:

K562_hg19_chr3_50K.cool
K562_hg19_FPKM_chr3.bed
K562_hg19_H3K27me3_chr3.bw
hg19_GC_chr3_50K.txt

The first thing to do is convert the bigwig file to a bedgraph file with bigWigToBedGraph K562_hg19_H3K27me3_chr3.bw K562_hg19_H3K27me3_chr3.bg. The .cool file contains chromatin interaction data, binned in 50Kb bins, along with normalization data. The FPKM file contains gene expression levels, the H3K27me3 bigwig file contains ChIP-seq normalized signal, and the GC file contains GC content data at 50Kb resolution.

Compartment analysis

Next, you will be using cooltools to do compartment calling. To do this, you will need to write a python script and load the cooltools module. Specifically, we will be using the cis_eig function.

from cooltools.eigdecomp import cis_eig

This function requires three arguments; a matrix of interaction data, the number of eigen vectors to calculate, and a signal track to orient the eigen vector data (it is random which sign the eigen vector tracks have). The matrix can be gotten from the .cool file, while the signal track is from the GC file. In order to extract a matrix of interaction values from the cool file you need to load it and then call for a matrix as follows:

cool = cooler.Cooler('K562_hg19_chr3_50K.cool')
mat = cool.matrix(balance=True)[:]

To determine nuclear compartments, you should use the cis_eig function. You can learn more in cooltools’ documentation. The first eigen vector indicates the nuclear compartment, with positive scores correspoding to the ‘A’ compartment and negative scores correspond to the ‘B’ compartment.

Plot the first eigen vector, along with the interaction matrix such that you can see the relationship between eigen vector scores and compartment patterns in the Hi-C data (compartment scores over heatmap). You may need to play with the vmax argument in the matplotlib matshow function to get a good dynamic range of the data (this puts a cap on the maximum score displayed). I used ten times the median score of all of the nonzero interactions, but you can play around with this number if you like.

Hint: You can look at the examples in the documentation for ways to nicely combine the two plots.

Create a bed file from the first eigen vector with each ‘A’ and ‘B’ region as determined by the sign of the eigen vector score.

Expression vs. Repression

Finally, let’s look at how repression, as inferred by H3K27me3, differs by compartment and expression for chromosome 3. The first step is to classify genes into two catergories, ‘on’ and ‘off’. To do this, create a histogram of gene expressions from the FPKM file, transforming the expression values using log2(FPKM + 1). You should see a steep drop-off from zero, followed by a hump. Approximate the low point between these two as the cutoff for genes that are off vs. being expressed. Next, you need to find the mean H3K27me3 level for each gene body and which compartment each gene is in. You can use bedtools for both using the map subcommand. For finding the H3K27me3 signal, you will need to adjust the score to reflect the fact that the intervals are not uniform. To do this, add another column to the bedgraph file with the following command in the terminal:

awk 'BEGIN{OFS="\t"}{$5=($3-$2)*$4; print $1,$2,$3,$4,$5}' K562_hg19_H3K27me3_chr3.bg > K562_hg19_H3K27me3_chr3_norm.bg

Now when you run bedtools map you can select the fifth column for the adjusted H3K27me3 score. However, you will still need to divide the summed scores you find for each gene by the length of the gene body. You can do this with a similar command:

awk 'BEGIN{OFS="\t"}{$7=$7 / ($3-$2); print $1,$2,$3,$4,$5,$6,$7}' K562_hg19_FPKM_chr3_map.bed > K562_hg19_FPKM_chr3_mapnorm.bed

You will also need to mark which compartment each gene resides in. To do this, use the map function, but indicate that the gene must overlap with a compartment by at least 50% and the operation should be “distinct” (the -o flag). You should now have a file that contains a normalized transformed H3K27me3 signal and compartment assignment for each gene.

Make a violin plot containing four categories:

A-compartment expressed
A-compartment not expressed
B-compartment expressed
B-compartment not expressed

Make sure to clearly label your X-axis.

Submit

Scripts (including python and command line) used to carry out the analysis
Chromosome 3 heatmap with compartment scores
Compartment bed file for chromosome 3
Gene expression histogram including a line indicating the cutoff you have selected
Violin plot with genes groups by compartment and expression status

Advanced Exercises

Add lines to your heatmap and compartment score plots indicating compartment boundaries
Create a “saddle plot”. This means stratifying the compartment scores into some number of equally sized ranges of scores and then calculating the mean interaction value between each of the pairwise combinations of score ranges from the interaction matrix (ignore the diagonal interactions, e.g. row 1 by columm 1). Essentially, each row and column would get assigned a compartment index based on which stratified bin its compartment score fell in. Then, for each row (index n) and column (index m), add its interaction score to the appropriate cell in the saddle matrix (n, m) and finally divide the saddle matrix cells by the total number of row x column combinations that were assigned to that cell. For this exercise, break the compartment score into 10 bins.