Assignment Date: Friday, Oct. 1, 2021
Due Date: Friday, Oct. 8, 2021 @ 1pm ET
We will use bismark
for aligning bisulfite converted reads and igv
for visualization. These should already be present in your base conda environment. However, if you don’t have these programs, you can create a conda environment for this:
conda create -n week6 -c bioconda -c anaconda fastqc bismark samtools bowtie2 igv
Published short-read sequence data is typically made available through SRA. This is mandated for NIH funded research.
We will use data from this paper Dynamic epigenomic landscapes during early lineage specification in mouse embryos.
Look for a section called Accession Numbers (Usually in the “Data Availability” section after the discussion). Use the information you find there to locate the data in the sequence read archive.
We’ll work with two datasets, STEM-seq E4.0ICM rep1 and STEM-seq E5.5Epi rep1. However, because of the sequencing depth required for analysis of bisulfite seqsuencing data, you will be working with a subset of reads specific to a region on chromosome 6. These data should already be on your laptops in /Users/cmdb/data/rawdata
with the following files:
If you need a new copy of any of these, you download them from here:
These reads have also been trimmed to remove a poly-G adapter. You will need to determine which file belongs to which sample using GEO (look for the SRA link on each sample’s webpage). I would recommend creating an index
folder and move chr6.fa.gz
into this folder. This will help keep things organized when you create your bismark index files. Finally, you need to unpack the fastq files:
tar xzf methylation_fastq.tar.gz
You should now have four fastq datasets (two cell types, paired end reads). Inspect the files with less
to be sure. Run fastqc
on one of the fastq files (one end of one dataset). I would recommend creating a folder to hold the output and specify this with -o
. This will help keep things organized. You do not need any other optional arguments. Inspect the resulting report. Does anything stand out? Can you explain why you’re seeing anything unusual?
Bismark is a mapping tool for bisulfite treated data.
You will first need to index your reference genome. For speed, we will just use chromosome 6 from mouse genome assembly 10 (mm10). You can use bismark_genome_preparation
to create an index for mapping to this genome. This will take a few minutes.
--help
flag to get details about each of the options and requirements for this program.--parallel 7
).Next, map your two experiments separately to the genome using bismark. The result will be a .bam
file for each condition containing the mapped reads and annotations of the methylation state. I recommend using -B
for naming your output files something useful for identifying the two conditions. Next, use deduplicate_bismark
to remove duplicate reads. Finally, use IGV to visualize these bam files. You will first need to create sorted BAM files (samtools sort
) and index them (samtools index
). You can switch IGV to use mm10 by choosing load genome from server.
To extract methylation data, we will use bismark_methylation_extractor
. Run this program on each of your two deduplicated BAM files, using the --bedgraph
and --comprehensive
options. You can load the resulting bedgraph files into IGV alongside your reads. Save an image of the region chr6:51,750,000-52,750,000 including the methylation bedgraph tracks for both conditions and the refseq gene track.
The file mm10_refseq_genes_chr6_50M_60M.bed contains all of the refseq genes found in the region chr6:50,000,000-60,000,000. Currently the file is in the format described here. However, we want to specifically look at the promoter regions (which we’re defining as the 10Kb upstream of the transcription start site as CpG islands can occur several Kb away). You can extract the promoters and write them to a bed file and remove duplicates (the original file contains an entry for each transcript) and Riken theoritical genes (identifiers ending with “Rik”) with the following command:
awk 'BEGIN{OFS="\t"}{if ($4 == "+") print $3,$5 - 2000,$5,$13,$12,$4; else print $3,$6,$6 + 2000,$13,$12,$4;}' mm10_refseq_genes_chr6_50M_60M.bed | grep -v Rik | uniq -f 3 | sort -k2,2n > promoters.bed
I highly recommend looking at some examples of awk as it is an incredibly versatile and handy command line tool to know how to use.
For each promoter, find the total methylation signal in E4 to E5.5 cells (sum all methylation site scores). You need to produce 3 plots:
You should make sure your plots have informative axis labels.
You should also use a t-test to see if the mean methylation of hox gene promoters is lower than other gene promoters in E4.0 and E5.5 cells (two tests, one for each cell type). Python has a function for performing such a test in the scipy package.