Assignment Date: Friday, November 19, 2021
Due Date: Friday, December 3, 2021 @ 1pm ET
We will use scanpy
for this lab, a fairly comprehensive Python package for scRNAseq analysis. We’ll create a conda environment that we can use for this lab called scanpy
. Use the command conda create -n scanpy python=3.6 scanpy matplotlib python-igraph jupyter leidenalg -c bioconda -c conda-forge
. Then you can activate the conda environment with the command conda activate scanpy
.
We will be looking at a dataset containing ~10,000 brain cells from an E18 mouse. This was produced using the 10x technology, using their most recent (v3) chemistry.
What’s already been done: the CellRanger package from 10x was used to align and count the reads. Reads were de-duplicated using UMIs and separated into “cells” based on barcodes, and then aligned to a transcriptome (using the STAR aligner). The result is a cell x gene matrix, which has been stored in a binary format (an hdf5 file).
Once you’re in the directory where you want the data, you can use the following wget
command to download it:
wget --no-check-certificate https://bx.bio.jhu.edu/data/cmdb-lab/scrnaseq/neuron_10k_v3_filtered_feature_bc_matrix.h5
To get you started, all access to Scanpy is typically through a module call scanpy
which is imported under the name sc
for convenience. We then load the count matrix into a table, which is an instance of AnnData
.
import scanpy as sc
# Read 10x dataset
adata = sc.read_10x_h5("neuron_10k_v3_filtered_feature_bc_matrix.h5")
# Make variable names (in this case the genes) unique
adata.var_names_make_unique()
For the rest of this lab, refer to the Scanpy documentation, and specifically the API documentation.
Filtering tools are largely under the sc.pp
module. We suggest using the Zheng et al. 2017 filtering approach. Produce a PCA plot before and after filtering (see the sc.api.tl
module to actually perform the PCA and sc.api.pl
for plotting).
Use leiden
clustering to identify clusters in the data. Produce t-SNE and UMAP plots showing the clusters.
Identify and plot genes that distinguish each cluster. Use both the t-test and logistic regression approaches, implemented through the rank_genes_groups
function.
Now the fun part.
Using your knowledge, identify some marker genes that should distinguish different brain cell types. You must identify at least 8 cell types.
dotplot
s and clustermap
s that allow you to see how a specific set of genes are associated with your clusters. Also, stacked violin plots, etc…Besides these support plots, make an overall t-SNE or UMAP plot that labels your clusters with the cell types you think they mostly represent. Make sure to provide the support plots you made in order to establish your labeling.