Assignment Date: Friday, Oct. 15, 2021
Due Date: Friday, Oct. 22, 2021 @ 1pm ET
Live-coding blank notebook: wget https://github.com/bxlab/qbb2021/raw/main/week6/3D%20Genome%20Live.ipynb
Live-coding master notebook: wget https://github.com/bxlab/qbb2021/raw/main/week6/3D%20Genome%20Live%20-%20Master.ipynb
The goal of today’s lab is to learn how to analyze high-throughput chromosome conformation capture data. We will be using data from the ENCODE project to look at nuclear compartments and seeing how compartmentalization corresponds to gene expression and H3K27me3. The Hi-C data has already been mapped and normalized. The next step is to call compartments using the tool suite cooltools
.
To install CoolTools, we will be creating a self-contained environment using Conda. This means that we don’t have to worry about what else is already installed, it won’t interfere. To do this, we need to create the environment and then populate it with CoolTool and its dependencies:
conda create -n hic cooltools cooler matplotlib numpy bedtools ucsc-bigWigToBedGraph jupyter
conda activate hic
Now we can use the cooltools python library for writing scripts for manipulating Hi-C data.
All of the data we will be using is located here in a tarball. Download it. To extract it, change to the directory you want to work in and run the command tar xzf 3Dgenome_encode.tar.gz
. You should have 4 files:
The first thing to do is convert the bigwig file to a bedgraph file with bigWigToBedGraph K562_hg19_H3K27me3_chr3.bw K562_hg19_H3K27me3_chr3.bg
. The .cool
file contains chromatin interaction data, binned in 50Kb bins, along with normalization data. The FPKM file contains gene expression levels, the H3K27me3 bigwig file contains ChIP-seq normalized signal, and the GC file contains GC content data at 50Kb resolution.
Next, you will be using cooltools
to do compartment calling. To do this, you will need to write a python script and load the cooltools module. Specifically, we will be using the cis_eig
function.
from cooltools.eigdecomp import cis_eig
This function requires three arguments; a matrix of interaction data, the number of eigen vectors to calculate, and a signal track to orient the eigen vector data (it is random which sign the eigen vector tracks have). The matrix can be gotten from the .cool
file, while the signal track is from the GC
file. In order to extract a matrix of interaction values from the cool file you need to load it and then call for a matrix as follows:
cool = cooler.Cooler('K562_hg19_chr3_50K.cool')
mat = cool.matrix(balance=True)[:]
To determine nuclear compartments, you should use the cis_eig
function. You can learn more in cooltools’ documentation. The first eigen vector indicates the nuclear compartment, with positive scores correspoding to the ‘A’ compartment and negative scores correspond to the ‘B’ compartment.
Plot the first eigen vector, along with the interaction matrix such that you can see the relationship between eigen vector scores and compartment patterns in the Hi-C data (compartment scores over heatmap). You may need to play with the vmax
argument in the matplotlib matshow
function to get a good dynamic range of the data (this puts a cap on the maximum score displayed). I used ten times the median score of all of the nonzero interactions, but you can play around with this number if you like.
Create a bed file from the first eigen vector with each ‘A’ and ‘B’ region as determined by the sign of the eigen vector score.
Finally, let’s look at how repression, as inferred by H3K27me3, differs by compartment and expression for chromosome 3. The first step is to classify genes into two catergories, ‘on’ and ‘off’. To do this, create a histogram of gene expressions from the FPKM
file, transforming the expression values using log2(FPKM + 1)
. You should see a steep drop-off from zero, followed by a hump. Approximate the low point between these two as the cutoff for genes that are off vs. being expressed. Next, you need to find the mean H3K27me3 level for each gene body and which compartment each gene is in. You can use bedtools for both using the map
subcommand. For finding the H3K27me3 signal, you will need to adjust the score to reflect the fact that the intervals are not uniform. To do this, add another column to the bedgraph file with the following command in the terminal:
awk 'BEGIN{OFS="\t"}{$5=($3-$2)*$4; print $1,$2,$3,$4,$5}' K562_hg19_H3K27me3_chr3.bg > K562_hg19_H3K27me3_chr3_norm.bg
Now when you run bedtools map
you can select the fifth column for the adjusted H3K27me3 score. However, you will still need to divide the summed scores you find for each gene by the length of the gene body. You can do this with a similar command:
awk 'BEGIN{OFS="\t"}{$7=$7 / ($3-$2); print $1,$2,$3,$4,$5,$6,$7}' K562_hg19_FPKM_chr3_map.bed > K562_hg19_FPKM_chr3_mapnorm.bed
You will also need to mark which compartment each gene resides in. To do this, use the map function, but indicate that the gene must overlap with a compartment by at least 50% and the operation should be “distinct” (the -o
flag). You should now have a file that contains a normalized transformed H3K27me3 signal and compartment assignment for each gene.
Make a violin plot containing four categories:
Make sure to clearly label your X-axis.