Assignment 10: Single-Cell RNA-seq

Assignment Date: Friday, November 19, 2021
Due Date: Friday, December 3, 2021 @ 1pm ET

Lecture

Conda environment

We will use scanpy for this lab, a fairly comprehensive Python package for scRNAseq analysis. We’ll create a conda environment that we can use for this lab called scanpy. Use the command conda create -n scanpy python=3.6 scanpy matplotlib python-igraph jupyter leidenalg -c bioconda -c conda-forge. Then you can activate the conda environment with the command conda activate scanpy.

Getting data

We will be looking at a dataset containing ~10,000 brain cells from an E18 mouse. This was produced using the 10x technology, using their most recent (v3) chemistry.

What’s already been done: the CellRanger package from 10x was used to align and count the reads. Reads were de-duplicated using UMIs and separated into “cells” based on barcodes, and then aligned to a transcriptome (using the STAR aligner). The result is a cell x gene matrix, which has been stored in a binary format (an hdf5 file).

Once you’re in the directory where you want the data, you can use the following wget command to download it: wget --no-check-certificate https://bx.bio.jhu.edu/data/cmdb-lab/scrnaseq/neuron_10k_v3_filtered_feature_bc_matrix.h5

The Assignment

Getting data into Scanpy

To get you started, all access to Scanpy is typically through a module call scanpy which is imported under the name sc for convenience. We then load the count matrix into a table, which is an instance of AnnData.

import scanpy as sc
# Read 10x dataset
adata = sc.read_10x_h5("neuron_10k_v3_filtered_feature_bc_matrix.h5")
# Make variable names (in this case the genes) unique
adata.var_names_make_unique()

For the rest of this lab, refer to the Scanpy documentation, and specifically the API documentation.

Step 1: Filtering

Filtering tools are largely under the sc.pp module. We suggest using the Zheng et al. 2017 filtering approach. Produce a PCA plot before and after filtering (see the sc.api.tl module to actually perform the PCA and sc.api.pl for plotting).

Step 2: Clustering

Use leiden clustering to identify clusters in the data. Produce t-SNE and UMAP plots showing the clusters.

Step 3: Distinguishing genes

Identify and plot genes that distinguish each cluster. Use both the t-test and logistic regression approaches, implemented through the rank_genes_groups function.

Step 4: Cell types?

Now the fun part.

Using your knowledge, identify some marker genes that should distinguish different brain cell types. You must identify at least 8 cell types.

You can color UMAP and t-SNE plots by any gene of your choice, which is helpful for visualizing which clusters are enriched for which genes, and which clusters might correspond to a specific brain cell type.
Alternatively, you can also produce dotplots and clustermaps that allow you to see how a specific set of genes are associated with your clusters. Also, stacked violin plots, etc…

Besides these support plots, make an overall t-SNE or UMAP plot that labels your clusters with the cell types you think they mostly represent. Make sure to provide the support plots you made in order to establish your labeling.

Submit

Python code
PCA plots before and after filtering from step 1
t-SNE and UMAP plots of clusters from step 2
Plots for genes that distinguish clusters (t-test and logistic regression) from step 3
The overall t-SNE or UMAP plot with at least 8 cell types labeled
Support plots that provide evidence for your cell type assignments