We will use scanpy
for this lab, a fairly comprehensive Python package for scRNA-seq analysis. Create a mamba environment that you will use for this lab called scanpy
using the following command:
mamba create -n scanpy scanpy openblas "matplotlib<3.7" leidenalg "pandas<2" -y
Then, activate the mamba environment with the command mamba activate scanpy
.
For the rest of this lab, refer to the Scanpy documentation, and specifically the API documentation.
This documentation is broken into several sections, grouping functions by stage of analysis. This grouping is reflected in the prefix of each function:
pp.
tl.
pl.
scanpy
packageYou will be looking at a dataset comprising scRNA-seq data of ~3,000 peripheral blood mononuclear cells (PBMCs) from a healthy human donor. This data was produced by 10X Genomics using their dedicated tools.
What’s already been done: the CellRanger package from 10x was used to align and count the reads. Reads were de-duplicated using UMIs (unique molecular identifiers) and separated into “cells” based on barcodes, and then aligned to a transcriptome (using the STAR aligner). The result is a cell x gene matrix of expression, which has been stored in a sparse matrix format.
Once you’re in the directory where you want the data, you can use the following command to download it:
wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
tar xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
This will create a folder, filtered_gene_bc_matrices/hg19/
with 3 files:
barcodes.tsv
genes.tsv
matrix.mtx
You will also need the live-coding excercise code. You can download the complete code here.
The live-coding script loads the pre-aligned and deduplicated reads into scanpy. It then preforms the following steps:
filtered_data.h5
variable_data.h5
Using the three files above, run the live-coding script to produce the 2 input files you will need for the homework assignment: filtered_data.h5
and variable_data.h5
.
You will first need to load the counts matrix from the live-coding exercise into a special kind of table, which is an instance of Scanpy’s AnnData
class.
You will be modifying this table throughout the homework as you run different analyses, and using it for plotting, all using scanpy functions.
Load the counts matrix into a scanpy AnnData
object using the code below:
import scanpy as sc
# Read the 10x dataset filtered down to just the highly-variable genes
adata = sc.read_h5ad("variable_data.h5")
adata.uns['log1p']['base'] = None # This is needed due to a bug in scanpy
NOTE: Most scanpy
functions will modify your adata
object in place. This means you generally don’t need to create new variables when you run scanpy
functions. You can simply run the function on the adata
object, and it will modify it/add data to it as necessary.
A good place to start with scRNA-seq data is to try to cluster cells based on their expression profiles. In this exercise you will be doing just that: clustering cells based on their expression profiles, and then visualizing those clusters, using tSNE and UMAP.
Before running clustering, scanpy needs to construct what is called a “neighborhood” graph, which essentially records each cell’s “nearest neighbors”: the cells whose expression looks the most similar to the focal cell.
Using the scanpy documentation, find a pre-processing function that computes a neighborhood graph from adata
. Run the function you found using the parameters n_neighbors=10
and n_pcs=40
.
Next, you will use leiden clustering (this is just a specific clustering algorithm, there are many others) to identify clusters in the data. This clustering algorithm will use the neighborhood graph you computed in Step 1.1 to actually produce the clusters.
As before, use the scanpy documentation to find a tool function that will run leiden clustering on your data. Run the function you found; you shouldn’t need any additional parameters.
In this step, you will be visualizing the clusters you identified in the previous steps using two separate approaches: UMAP and t-SNE. UMAP and t-SNE are both dimensionality reduction tools for visualizing structure in your data, just like PCA, but they make fewer assumptions about how the data actually is structured.
Before you can produce UMAP and tSNE plots, you’ll actually need to run the UMAP and tSNE algorithms, which essentially “embed” your high-dimensional data into two dimensions.
Use the scanpy documentation to find a tool function to run the UMAP algorithm on your data. Note that to create the UMAP transformation, you need to specify maxiter
. We suggest 900.
Now, find a tool function to run the tSNE algorithm on your data. You should not need any additional parameters.
Now that you’ve run the UMAP and t-SNE algorithms, you’ll want to plot the output. Find a scanpy plotting function to plot your UMAP results. You’ll want to specify the color
argument to color the cells by their leiden cluster (color='leiden'
), the title
argument to set the panel title, and the save
argument to specify a file suffix for saving the plot. Note that plots are saved into a folder figures
that scanpy creates. Now find a scanpy plotting function to plot your t-SNE results. This function takes the same arguments as the t-SNE plotting function.
Title your figures appropriately. You will be uploading them for the assignment.
Now that you’ve identified clusters in your data, ideally you’d like to be able to determine what cell types those clusters correspond to. The first step to doing that is to identify the “marker genes” that seem to define each cluster: which genes are specifically upregulated in each cluster, relative to the other clusters. In this exercise, you’ll be identifying each cluster’s marker genes.
Scanpy has a great tool function for identifying and ranking potential marker genes in each cluster: rank_genes_groups()
. This function has multiple different methods for ranking genes; you’ll be using and comparing two: 1) Wilcoxon rank-sum and 2) logistic regression.
First: using the sc.tl.rank_genes_groups()
function, rank marker genes in your data using the Wilcoxon rank-sum method. You’ll want to specify the method
argument to use the Wilcoxon rank-sum method, as well as the groupby
argument to use the leiden clusters you identified earlier (groupby='leiden'
). You’ll also want to set use_raw=True
and copy=True
. Store the results of running this function in a new wilcoxon_adata
variable. This will essentially be a copy of the adata
object, but with the Wilcoxon rank_genes_groups
data added.
Second: using the sc.tl.rank_genes_groups()
function, rank marker genes in your data using the logistic regression method. You’ll want to specify the method
argument to use the logistic regression method, as well as the groupby
argument to use the leiden clusters you identified earlier (groupby='leiden'
). You’ll also want to set use_raw=True
and copy=True
. Store the results of running this function in a new logreg_adata
variable. This will essentially be a copy of the adata
object, but with the Logistic regression rank_genes_groups
data added.
Helpfully, scanpy has another plotting function for visualizing the gene rankings from sc.tl.rank_genes_groups()
.
First, using the sc.pl.rank_genes_groups()
function with the wilcoxon_adata
object you created in the previous step, plot the top 25 genes for each cluster when using the Wilcoxon rank-sum method. As before, use save
to save your plot. You’ll want to specify the n_genes
and title
arguments. You’ll also want to set sharey=False
, show=False
, and use_raw=True
. Label your figure appropriately and save it to a file. You will be uploading it with the assignment.
Second, using the sc.pl.rank_genes_groups()
function with the logreg
object you created in the previous step, plot the top 25 genes for each cluster when using the logistic regression method. Use the same arguments as you did for the wilcoxon data. Label your figure appropriately and save it to a file. You will be uploading it with the assignment.
Now that you’ve identified marker genes for each cluster in your data set, you want to assign cell-types to each cluster, based on those marker genes.
In the next couple of steps, you may receive errors trying to plot the data for certain genes that you pull from the highly-ranked genes. That’s because the ranking is done across all genes. Because some genes are filtered out at the “highly-variable gene” filtering step, despite showing cell-type specificity, you should load the dataset filtered_data.h5
which has the genes that were removed by the highly-variable gene filtering. In order to keep the leiden clusters and UMAP or t-SNE mappings, you will need to copy them from your clustered and transformed data. To do this, you can copy the values to a variable and then copy that variable to the filtered_data.h5
dataset.
leiden = adata.obs['leiden']
umap = adata.obsm['X_umap']
tsne = adata.obsm['X_tsne']
adata = sc.read_h5ad('filtered_data.h5')
adata.obs['leiden'] = leiden
adata.obsm['X_umap'] = umap
adata.obsm['X_tsne'] = tsne
You can do this as a single script or by creating 2 scripts and saving the data at the end of the first script and loading it in the second, to avoid having to rerun the clustering steps, like this:
adata.write('filtered_clustered_data.h5')
and load in a new script:
adata = sc.read_h5ad("filtered_clustered_data.h5")
adata.uns['log1p']['base'] = None # This is needed due to a bug in scanpy
To plot expression data for multiple genes, each in their own plot, pass a list of gene names to the UMAP or t-SNE plotting function using the color
argument. You can also add a subplot with cluster labels by including the name leiden
in the list passed to color
.
Now the fun part.
Identify some marker genes in your data set that should distinguish different blood cell types. Your goal is to try to identify at least 3 distinct cell types in your data based on the marker genes from exercise 2. You can do this using your knowledge, the knowledge of hematopoeisis aficionados in your cohort, or the power of the internet. There are many resources online for identifying relationships between marker genes and cell types. Take a look at this database for an example of ways to identify cell types. You may need multiple marker genes to distinguish between clusters.
Once you have a short list of genes that should distinguish cell-types in your data, produce ONE figure that supports your hypothesis that these genes should be marking specific clusters in your dataset. You have several options (and this link may help you):
A set of UMAP or t-SNE plots, colored by expression of a specific gene. You can color the plots by a gene by using the color
argument. Create a multi-panel figure (one panel per marker gene) showing the expression of these genes in either a UMAP or t-SNE plot. Compare your results to your plot from step 1.3 to validate your hypothesis (no need to write anything down).
dotplot
showing expression of your chosen marker genes in each clusterclustermap
showing expression of your chosen marker genes in each clusterChoose one approach and produce the corresponding figure. Make sure that the marker genes you chose do indeed match your expectations. Label your figure appropriately and save it to a file. You will be uploading it with the assignment.
Now that you’ve confirmed that the marker genes you’ve chosen do indeed “mark” the three clusters you expected them to, you want to update your cluster labels with the cell-types corresponding to those clusters.
Using adata.rename_categories()
rename the clusters in your adata
object to the cell types you think they are (based on the marker genes). This function requires two arguments, the annotation to relabel (leiden
), and a list of new unique labels of the same length as the number of clusters.
Finally, make an overall t-SNE or UMAP plot that labels your clusters with the cell types you think they mostly represent, either on the plot or in a legend. Label your figure appropriately and save it to a file. You will be uploading it with the assignment.
Total Points: 10
Identify all clusters (there should be 8), in addition to the three you chose in step 3.2. Create a dotplot with a gene as specific to each cluster as you can find (8 clusters, 8 genes). Create a multi-panel UMAP or t-SNE plot colored by the expression of each gene as well as the leiden clusters.