Assignment Date: Friday, November 12, 2021
Due Date: Friday, November 19, 2021 @ 1pm ET
We are going to be working with the Drosophila embryo gene expression data that you processed for bootcamp. Please obtain the FPKM values for each of the 8 samples here: https://www.dropbox.com/s/hxjvua05abz8zqb/all_annotated.csv?dl=1
. You can either use pd.read_csv()
to read in the csv from the url directly, or use wget
to download the data to your computer.
SciPy
has a lot of tools for different kinds of clustering. Here we will be using the tools for hierarchical clustering. Particularly, we will focus on using linkage
and leaves_list
.
The documentation for SciPy isn’t very helpful, but with some quick Googling you can find examples of how to use both of these tools.
Limit the dataset to genes with a median expression across samples of greater than zero.
Apply a log2(FPKM + 0.1) transformation to the data.
Cluster the data matrix for both genes and samples on their patterns of expression (so both the rows and columns of the matrix), and plot a heatmap of the gene expression data.
Next create a dendrogram relating the samples to one another.
Work with the same low-expression-filtered and log2-transformed dataset that you prepared for the clustering analysis above.
Use ordinary least squares regression to test for genes that are differentially expressed across stages. Use the stage number as a numeric independent variable (10, 11, 12, 13, 14), and ignore the letter suffixes on day 14 (i.e., treat 14A, 14B, 14C, and 14D as equivalent).
Generate a QQ plot from the p-values.
Report the list of genes that exhibit differential expression by stage at a 10% false discovery rate.
Repeat the analysis while controlling for sex.
Report the list of genes that exhibit differential expression by stage at a 10% false discovery rate while controlling for sex.
Compare the lists–what is the percentage overlap with and without sex as a covariate?
Generate a volcano plot of the differential expression, with sex as a covariate, results ( -log10(p-value) on the y-axis, beta on the x axis). Color the significant points in a different color.
Note: Feel free to submit the plots and code as a single jupyter notebook
Here are some awesome resources for you. We don’t expect you to read these all, but they’re relevant for discussions we had in today’s lecture and could be helpful for you in your future research.