Assignment 9: Gene Expression

Assignment Date: Friday, November 12, 2021
Due Date: Friday, November 19, 2021 @ 1pm ET

Slides

Getting your data

We are going to be working with the Drosophila embryo gene expression data that you processed for bootcamp. Please obtain the FPKM values for each of the 8 samples here: https://www.dropbox.com/s/hxjvua05abz8zqb/all_annotated.csv?dl=1. You can either use pd.read_csv() to read in the csv from the url directly, or use wget to download the data to your computer.

The Assignment

Clustering

SciPy has a lot of tools for different kinds of clustering. Here we will be using the tools for hierarchical clustering. Particularly, we will focus on using linkage and leaves_list.

The documentation for SciPy isn’t very helpful, but with some quick Googling you can find examples of how to use both of these tools.

Limit the dataset to genes with a median expression across samples of greater than zero.

Apply a log2(FPKM + 0.1) transformation to the data.

Cluster the data matrix for both genes and samples on their patterns of expression (so both the rows and columns of the matrix), and plot a heatmap of the gene expression data.

Next create a dendrogram relating the samples to one another.

Differential expression

Work with the same low-expression-filtered and log2-transformed dataset that you prepared for the clustering analysis above.

Use ordinary least squares regression to test for genes that are differentially expressed across stages. Use the stage number as a numeric independent variable (10, 11, 12, 13, 14), and ignore the letter suffixes on day 14 (i.e., treat 14A, 14B, 14C, and 14D as equivalent).

Generate a QQ plot from the p-values.

Report the list of genes that exhibit differential expression by stage at a 10% false discovery rate.

Repeat the analysis while controlling for sex.

Report the list of genes that exhibit differential expression by stage at a 10% false discovery rate while controlling for sex.

Compare the lists–what is the percentage overlap with and without sex as a covariate?

Generate a volcano plot of the differential expression, with sex as a covariate, results ( -log10(p-value) on the y-axis, beta on the x axis). Color the significant points in a different color.

Submit

Plot: Clustered heatmap of gene expression.
Plot: Dendrogram of cell types
Plot: QQ plot of differential expression results
Text: List of differentially expressed genes, with and without sex as a covariate.
Text: Percentage overlap: ((# overlapping genes) / (# genes in list without covariate)) * 100
Plot: Volcano plot
Code: All code from analysis.

Note: Feel free to submit the plots and code as a single jupyter notebook

Resources

Here are some awesome resources for you. We don’t expect you to read these all, but they’re relevant for discussions we had in today’s lecture and could be helpful for you in your future research.

RNA-sequencing overview. Figure 2 is especially useful and presents some of the quantification steps/tools that we haven’t shown you.
Specific example pipeline from Steven Salzberg & Co, commonly used in previous iterations of bootcamp
Batch effects discussion from Stephanie Hicks, specifically concerning single cell RNAseq
Replicates vs Depth and the relation to statistical power
The Omnigenic inheritance model related to the discussion of cis- and trans- effects