In today’s assignment you will practice the principles of data visualization that were covered in the lecture. Visualizations should accurately and concisely convey a message in an easily understandable way without misleading the audience. Avoid extraneous visual elements.
You will start with some more directed visualizations of the GTEx RNA-seq dataset you analyzed last week, followed by a more open-ended project.
Re-load the data you downloaded in the previous lab session. These data comprise RNA-seq samples from whole blood from the GTEx Consortium (755 total individuals). As a reminder, they were downloaded directly from the GTEx portal and slightly reformatted to save you some time on tedious data wrangling. If needed, download again from the Dropbox links below:
Subject-level metadata
Gene expression matrix
In the first exercise, you’ll be exploring some aspects of the GTEx whole blood data and generating plots that communicate the observed patterns in an easy-to-interpret manner. You will be producing four figures. Each should be saved as it’s own separate file and uploaded as part of the assignment.
For all figures, label the axes appropriately, provide legends only when necessary, do not place a title or any other elements on the plot itself. You will be graded on proper labelling.
Create a plotting_exercise1.py
script now for this exercise.
In your plotting_exercise1.py
script, load and normalize the GTEx data using the code below:
import numpy as np
import pandas as pd
from pydeseq2 import preprocessing
from matplotlib import pyplot as plt
# read in data
counts_df = pd.read_csv("gtex_whole_blood_counts_formatted.txt", index_col = 0)
# read in metadata
metadata = pd.read_csv("gtex_metadata.txt", index_col = 0)
# normalize
counts_df_normed = preprocessing.deseq2_norm(counts_df)[0]
# log
counts_df_logged = np.log2(counts_df_normed + 1)
# merge with metadata
full_design_df = pd.concat([counts_df_logged, metadata], axis=1)
For subject GTEX-113JC, plot the distribution of expression (logged normalized counts) across all genes, excluding any genes with 0 counts. Upload this figure for the assignment.
For the gene MXD4, plot the distribution of expression (logged normalized counts) in males versus females. Upload this figure for the assignment.
Plot the number of subjects in each age category. Upload this figure for the assignment.
For the gene LPXN, plot the median expression (logged normalized counts) over time (i.e. in each age category), stratified by sex. Upload this figure for the assignment.
We highly encourage you to do this exercise with a partner.
With your partner, select any data set from the TidyTuesday repository on GitHub: https://github.com/rfordatascience/tidytuesday. In this exercise, you’ll be exploring interesting patterns in the data set that you choose.
Create a plotting_exercise2.py
script now for this exercise.
Using Python (pandas
, numpy
, matplotlib
, etc.) explore these data with your partner, searching for any interesting features or patterns. Jot down any interesting patterns you observe as notes (no need to submit). For each feature/pattern you observe, think about what kind of plot would best communicate that feature/pattern.
Choose three aspects/patterns of these data that are best represented by three different types of plots (e.g., histogram, bar plot, line plot, heatmap, etc.).
Generate these figures using matplotlib
. As always, label the axes appropriately and avoid extraneous visual elements. For each plot, add a title that concisely states the message that your figure is attempting to convey.
For this assignment you should submit the following:
plotting_exercise1.py
script to load and analyze data and produce plots from Exercise 1 (3 points total)
plotting_exercise2.py
script to load and analyze data and produce plots from Exercise 2 (3.5 points total)
Total Points: 10
DO NOT push any raw data! Only the things we asked for!
For each age range, plot the proportion of samples within each group of the Hardy scale. This should be a single panel figure.
We’d like to see if we can identify any broad patterns present in our gene expression data. To explore this, we’re going to cluster the data, both by sample as well as by gene.
To perform clustering, you’ll be using the dendrogram
, linkage
and leaves_list
functions from scipy
. The documentation for SciPy isn’t very helpful, but with some quick Googling you can find examples of how to use both of these tools.
Using linkage
and leaves_list
, cluster the filtered and log2 transformed gene expression data matrix for both genes and samples based on their patterns of expression (so both the rows and columns of the matrix). You will find the numpy transpose functionality useful in order to cluster the columns.
Plot a heatmap of the clustered gene expression data.
Using the dendrogram function, create a dendrogram relating the samples to one another.