Assignment 4: DNA Methylation

Assignment Date: Friday, Oct. 1, 2021
Due Date: Friday, Oct. 8, 2021 @ 1pm ET

Lecture

Today’s slides

Homework

Before you start

We will use bismark for aligning bisulfite converted reads and igv for visualization. These should already be present in your base conda environment. However, if you don’t have these programs, you can create a conda environment for this:

conda create -n week6 -c bioconda -c anaconda fastqc bismark samtools bowtie2 igv

Getting the data

Published short-read sequence data is typically made available through SRA. This is mandated for NIH funded research.

We will use data from this paper Dynamic epigenomic landscapes during early lineage specification in mouse embryos.

Look for a section called Accession Numbers (Usually in the “Data Availability” section after the discussion). Use the information you find there to locate the data in the sequence read archive.

We’ll work with two datasets, STEM-seq E4.0ICM rep1 and STEM-seq E5.5Epi rep1. However, because of the sequencing depth required for analysis of bisulfite seqsuencing data, you will be working with a subset of reads specific to a region on chromosome 6. These data should already be on your laptops in /Users/cmdb/data/rawdata with the following files:

  • mm10_refseq_genes_chr6_50M_60M.bed
  • methylation_fastq.tar.gz
  • chr6.fa.gz

If you need a new copy of any of these, you download them from here:

These reads have also been trimmed to remove a poly-G adapter. You will need to determine which file belongs to which sample using GEO (look for the SRA link on each sample’s webpage). I would recommend creating an index folder and move chr6.fa.gz into this folder. This will help keep things organized when you create your bismark index files. Finally, you need to unpack the fastq files:

tar xzf methylation_fastq.tar.gz

You should now have four fastq datasets (two cell types, paired end reads). Inspect the files with less to be sure. Run fastqc on one of the fastq files (one end of one dataset). I would recommend creating a folder to hold the output and specify this with -o. This will help keep things organized. You do not need any other optional arguments. Inspect the resulting report. Does anything stand out? Can you explain why you’re seeing anything unusual?

Bisulfite mapping with Bismark

Bismark is a mapping tool for bisulfite treated data.

You will first need to index your reference genome. For speed, we will just use chromosome 6 from mouse genome assembly 10 (mm10). You can use bismark_genome_preparation to create an index for mapping to this genome. This will take a few minutes.

  • Hint: You can use the --help flag to get details about each of the options and requirements for this program.
  • Note: Bismark defaults to bowtie2 so you shouldn’t need to pass the argument to use bowtie2.
  • Note: You can speed things up by using multiple cores. I recommend 7 (--parallel 7).

Next, map your two experiments separately to the genome using bismark. The result will be a .bam file for each condition containing the mapped reads and annotations of the methylation state. I recommend using -B for naming your output files something useful for identifying the two conditions. Next, use deduplicate_bismark to remove duplicate reads. Finally, use IGV to visualize these bam files. You will first need to create sorted BAM files (samtools sort) and index them (samtools index). You can switch IGV to use mm10 by choosing load genome from server.

To extract methylation data, we will use bismark_methylation_extractor. Run this program on each of your two deduplicated BAM files, using the --bedgraph and --comprehensive options. You can load the resulting bedgraph files into IGV alongside your reads. Save an image of the region chr6:51,750,000-52,750,000 including the methylation bedgraph tracks for both conditions and the refseq gene track.

Analysis

The file mm10_refseq_genes_chr6_50M_60M.bed contains all of the refseq genes found in the region chr6:50,000,000-60,000,000. Currently the file is in the format described here. However, we want to specifically look at the promoter regions (which we’re defining as the 10Kb upstream of the transcription start site as CpG islands can occur several Kb away). You can extract the promoters and write them to a bed file and remove duplicates (the original file contains an entry for each transcript) and Riken theoritical genes (identifiers ending with “Rik”) with the following command:

awk 'BEGIN{OFS="\t"}{if ($4 == "+") print $3,$5 - 2000,$5,$13,$12,$4; else print $3,$6,$6 + 2000,$13,$12,$4;}' mm10_refseq_genes_chr6_50M_60M.bed | grep -v Rik | uniq -f 3 | sort -k2,2n > promoters.bed

I highly recommend looking at some examples of awk as it is an incredibly versatile and handy command line tool to know how to use.

For each promoter, find the total methylation signal in E4 to E5.5 cells (sum all methylation site scores). You need to produce 3 plots:

  1. A histogram of cumulative methylation scores occuring in each promoter for E4.0 cells
  2. A histogram of cumulative methylation scores occuring in each promoter for E5.5 cells
  3. A scatterplot of E4.0 by E5.5 promoter methlylation scores. Color the hox genes a different color than the remaining genes.
  • Hint: Take a look at bedtools for help with summing the scores. There may be a relevant subfunction (wink wink).
  • Hint: Make sure to look at the files produced. What happens when there are no methylation calls in the promoter?

You should make sure your plots have informative axis labels.

You should also use a t-test to see if the mean methylation of hox gene promoters is lower than other gene promoters in E4.0 and E5.5 cells (two tests, one for each cell type). Python has a function for performing such a test in the scipy package.

Submit

  • A record of your command line calls
  • An image of your methylation data for the specified region from IGV
  • Your script(s) for plotting the 3 requested plots and t-tests

Advanced excercise

  • Write a Python script to extract the promoters from the refseq gene file (in cases with multiple transcripts, you can just take the coordinates from the first one)
  • Write a Python script to sum the methylation signal in each promoter