scRNA-seq Tutorial

Section 01 — Concepts

What is scRNA-seq?

Single-cell RNA sequencing measures gene expression in thousands of individual cells simultaneously. Unlike bulk RNA-seq which gives an average across all cells, scRNA-seq resolves cellular heterogeneity — you identify distinct cell types, rare populations, and transcriptional states within the same sample.

Bulk RNA-seq is like blending a fruit salad and measuring the average flavor. scRNA-seq is like tasting each piece individually — you find out there are strawberries, grapes, and kiwi that the blended average would never reveal.

Feature	Bulk RNA-seq	scRNA-seq
Resolution	Tissue average	Single cell
Cell types	Mixed together	Resolved individually
Rare populations	Masked	Detectable
Sparsity	Low	~90% zeros
Spatial info	❌	❌ (dissociated)
Cost per sample	Low	Higher

Figure 1: Complete scRNA-seq workflow from tissue to biological insights.

Section 02 — Concepts

Droplets, Barcodes & UMIs

Understanding the technology explains why your data looks the way it does — sparse, with cell barcodes and UMI counts.

The droplet system

Cells flow through a microfluidic chip and are captured one-per-droplet with a barcoded bead. Inside each droplet, RNA is reverse-transcribed and tagged with two identifiers before sequencing.

Figure 2: Each droplet contains one cell + one barcoded bead. mRNA is tagged with the cell barcode and a UMI before sequencing.

What is a cell barcode?

A short DNA sequence (16–20bp) unique to each bead. After sequencing, every read carries this barcode — so you know which cell it came from. 10x Chromium uses a whitelist of ~6,700 known valid barcodes for error correction.

What is a UMI?

A random 10–12bp sequence that tags each individual mRNA molecule before PCR amplification. This solves a critical problem:

Figure 3: UMIs prevent PCR duplication inflation. Each original molecule gets a unique tag — duplicates share the same UMI and are collapsed to a count of 1.

ℹ️

snRNA-seq vs scRNA-seq

Some datasets use single-nucleus RNA-seq (snRNA-seq) — nuclei isolated instead of whole cells. Better for hard-to-dissociate tissues like brain. Cytoplasmic RNAs are underrepresented but the analysis pipeline is nearly identical.

Section 03 — Dataset

The PBMC 3k Dataset

3k PBMCs from a Healthy Donor

10x Genomics · Chromium v1 chemistry · Illumina NextSeq 500 · hg19 reference genome

Peripheral Blood Mononuclear Cells from a healthy donor. The canonical teaching dataset — used in the official Scanpy and Seurat tutorials. Small enough to run on any laptop in minutes.

2,700Cells

32,738Genes

~2,638Cells after QC

~2,000HVGs used

→ Download from 10x Genomics

🔬

Understanding the numbers

2,700 cells = rows. Each cell was captured in its own droplet and barcoded uniquely.

32,738 genes = columns. Every gene in the hg19 genome — but ~90% of values are zeros. This sparsity is biology, not a problem: most genes are not expressed in most cells.

2,638 after QC = 62 cells removed (high mito %, too few genes = dying or empty).

~2,000 HVGs = only the most variable genes are used for clustering. The rest add noise.

What cell types to expect in PBMCs

CD4 T cells

IL7R, CCR7, CD3D

CD8 T cells

CD8A, GZMB, NKG7

NK cells

NKG7, GNLY, KLRD1

B cells

MS4A1, CD79A, CD19

CD14 Monocytes

CD14, LYZ, CST3

FCGR3A Monocytes

FCGR3A, MS4A7

Dendritic cells

FCER1A, CST3

Platelets

PPBP, PF4

Section 04 — Pipeline

Upstream: FASTQ → Count Matrix

Cell Ranger already ran for the PBMC 3k dataset. But understanding what happened upstream explains why your matrix looks the way it does.

UPSTREAM — core facility / Cell Ranger

🧬

wet lab

10x Chromium droplet capture

Cells + barcoded beads encapsulated one-per-droplet. mRNA reverse-transcribed and tagged with cell barcode + UMI.

🔬

sequencer

Illumina sequencing → FASTQ

Read 1 (28bp) = cell barcode + UMI. Read 2 (91bp) = cDNA insert. Paired-end output: R1.fastq.gz + R2.fastq.gz.

⚙️

Cell Ranger / STARsolo

Alignment + barcode correction + UMI counting

STAR aligns reads to hg19. Cell Ranger matches barcodes to whitelist, deduplicates UMIs, identifies real cells vs empty droplets via knee plot. Output: matrix.mtx + barcodes.tsv + genes.tsv.

DOWNSTREAM — your laptop starts here ↓

ℹ️

Cell Ranger is your nf-core equivalent

Just like nf-core/rnaseq takes FASTQs and outputs a counts matrix for DESeq2, Cell Ranger takes FASTQs and outputs a sparse matrix for Scanpy. You run it once on the cluster, then do all analysis locally. The open-source alternative is STARsolo — same output format, no license needed.

STARsolo on Xanadu (SBATCH)

BASH starsolo.sh

#!/bin/bash
#SBATCH --job-name=STARsolo
#SBATCH --ntasks=8
#SBATCH --mem=64G

module load STAR/2.7.10a

STAR --runThreadN 8 \
  --genomeDir genome_index/ \
  --readFilesIn sample_R2.fastq.gz sample_R1.fastq.gz \
  --readFilesCommand zcat \
  --soloType CB_UMI_Simple \
  --soloCBwhitelist 10x_barcodes_v3.txt \
  --soloCBstart 1 --soloCBlen 16 \
  --soloUMIstart 17 --soloUMIlen 12 \
  --soloStrand Forward \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix output/sample_

Section 05 — Pipeline

The Count Matrix & AnnData

The count matrix is the output of Cell Ranger and the input to Scanpy. Understanding its structure is essential.

Figure 4: Three Cell Ranger output files load into an AnnData object. The structure is identical to the spatial proteomics tutorial — just no spatial coordinates.

ℹ️

Why ~90% zeros?

Sparsity arises from two sources. Biological: most genes are cell-type specific — insulin is only expressed in β-cells, HBB only in red blood cells. Technical: low-abundance transcripts may not get captured during library prep (dropout). Both contribute. Sparse matrix formats (.mtx) store only non-zero values, keeping file sizes manageable.

Section 06 — Setup

Environment Setup

BASHsetup.sh

conda create -n scrna python=3.10 -y
conda activate scrna

pip install scanpy anndata
pip install pandas numpy matplotlib seaborn
pip install leidenalg igraph
pip install jupyter

jupyter notebook

Section 07 — Pipeline

Load the Data

PYTHON01_load.py

import scanpy as sc
import numpy as np

sc.settings.verbosity = 3
sc.set_figure_params(dpi=80, facecolor="white")

# ── Option A: auto-download (easiest) ────────────────────────────
adata = sc.datasets.pbmc3k()

# ── Option B: load from Cell Ranger output ────────────────────────
adata = sc.read_10x_mtx(
    "data/filtered_gene_bc_matrices/hg19/",
    var_names="gene_symbols",
    cache=True
)
adata.var_names_make_unique()

print(adata)
# AnnData object with n_obs × n_vars = 2700 × 32738

Section 08 — Pipeline

The Mathematical Goal

Before diving into code, it helps to understand the mathematical goal of the entire pipeline.

We start with a sparse, high-dimensional count matrix — 2,700 cells × 32,738 genes. Each cell is a point in 32,738-dimensional space. Our goal is to transform this into a low-dimensional representation where distance reflects biological similarity. Cells of the same type should cluster together; different types should be far apart. Everything — normalization, HVG selection, PCA, UMAP, Leiden — serves this single objective.

🧠

The transformation pipeline in one sentence

Raw counts → correct for technical noise (QC, normalization) → reduce dimensions (HVG, PCA) → embed in 2D (UMAP) → find communities (Leiden) → assign biology (annotation). Each step removes noise and preserves signal.

⚠️

UMAP distances are not absolute

UMAP is a visualization tool, not a quantitative embedding. The distance between two clusters on a UMAP does not reflect how biologically different they are. Two clusters far apart on UMAP may be more similar than two clusters that appear close. Use UMAP to see structure — use quantitative methods (DEG, pseudotime) to measure it.

Section 09 — Pipeline

QC & Filtering

QC removes low-quality cells. Three metrics identify problems. This step is unique to scRNA-seq — IMC doesn't need it because tissue is fixed, not dissociated.

⚠️

Thresholds are dataset-dependent — always look at distributions first

The values below (200/2500/5%) are reasonable starting points for PBMC data. Tumor cells legitimately have more genes. Brain cells have higher baseline mito%. Always inspect violin plots before filtering — cutting too aggressively removes real rare cell populations.

n_genes_by_counts

Filter: <200 or >2500

Too few = empty droplet or dead cell. Too many = doublet (two cells captured together).

total_counts

Filter: outliers

Total UMIs per cell. Correlates with n_genes. Extremes = doublets or poor quality.

pct_counts_mt

Filter: >5%

% mitochondrial genes. High = dying cell. Cytoplasmic RNA leaks out; mt-RNA stays. No equivalent in IMC.

Figure 5: Three cell fates in QC. Healthy cells pass all thresholds. Dying cells show high mito% (cytoplasm leaks out, mitochondria stay). Doublets show abnormally high gene counts.

PYTHON02_qc.py

# ── Flag mitochondrial genes ──────────────────────────────────────
adata.var["mt"] = adata.var_names.str.startswith("MT-")

# ── Calculate QC metrics ──────────────────────────────────────────
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)

# ── Visualize ────────────────────────────────────────────────────
sc.pl.violin(adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4, multi_panel=True, save="_qc.pdf"
)

# ── Filter ────────────────────────────────────────────────────────
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_cells(adata, max_genes=2500)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.pct_counts_mt < 5]

print(f"After QC: {adata.n_obs} cells")   # ~2638

ℹ️

Doublet detection — go beyond n_genes threshold

The n_genes >2500 filter catches obvious doublets but misses subtle ones. For rigorous analysis, use a computational doublet detector: pip install scrublet — it simulates artificial doublets and scores each cell for doublet probability. Recommended for any dataset going to publication.

ℹ️

Cell cycle effects

Actively cycling cells express cell cycle genes (S and G2M phase markers) that can dominate clustering — you end up with clusters defined by cell cycle state rather than cell type. Score cell cycle phases with sc.tl.score_genes_cell_cycle() and optionally regress them out before PCA.

Section 10 — Pipeline

Normalization

Cells are sequenced to different depths. Without correction, deeply-sequenced cells look like they express more of everything — a technical artifact, not biology. Normalization adjusts for sequencing depth, though it does not fully correct compositional bias — if one gene is very highly expressed it can still influence the scaling of others. For most analyses this is acceptable; for sensitive comparisons consider SCTransform.

Figure 6: Library size correction then log1p transform makes cells with different sequencing depths directly comparable.

PYTHON03_normalize.py

# ── Store raw counts before transformation ────────────────────────
adata.layers["counts"] = adata.X.copy()
adata.raw = adata          # freeze raw — needed for DEG later

# ── Normalize to 10,000 counts per cell ───────────────────────────
sc.pp.normalize_total(adata, target_sum=1e4)

# ── log1p transform ───────────────────────────────────────────────
# log1p(x) = log(x+1) handles zeros, compresses dynamic range
sc.pp.log1p(adata)

Section 11 — Pipeline

Highly Variable Genes

With 32,738 genes, most add noise. HVGs are genes with higher variance than expected given their mean expression — this mean-variance correction is important because high-count genes naturally show more variance. Simply taking the most variable genes without this correction would always select highly expressed housekeeping genes. Unlike scRNA-seq, IMC measures a predefined protein panel, so feature selection occurs during experimental design rather than analysis.

PYTHON04_hvg.py

# ── Find top 2000 most variable genes ────────────────────────────
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=2000,
    min_mean=0.0125, max_mean=3, min_disp=0.5,
    flavor="seurat_v3",
    layer="counts"
)
sc.pl.highly_variable_genes(adata, save="_hvg.pdf")

print(f"HVGs: {adata.var.highly_variable.sum()} / {adata.n_vars}")

# ── Scale to unit variance before PCA ────────────────────────────
# Centers each gene mean=0, variance=1
# Clip outliers at 10 std devs
sc.pp.scale(adata, max_value=10)

Section 12 — Pipeline

PCA, UMAP & Clustering

Even 2,000 HVGs is too high-dimensional to cluster directly. PCA reduces to ~50 components, then UMAP projects to 2D for visualization. Leiden finds communities in the neighbor graph.

Figure 7: Dimensionality reduction pipeline. 2000 genes → 50 PCs (linear) → neighbor graph → UMAP (2D) + Leiden clusters.

PYTHON05_clustering.py

# ── PCA — only uses HVGs automatically ───────────────────────────
sc.tl.pca(adata, svd_solver="arpack", n_comps=50)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)   # check elbow

# ── Neighbor graph ────────────────────────────────────────────────
# n_pcs=40 standard for PBMC; adjust based on elbow plot
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

# ── UMAP ──────────────────────────────────────────────────────────
sc.tl.umap(adata, random_state=42)

# ── Leiden clustering ─────────────────────────────────────────────
# flavor="igraph" avoids leidenalg dependency
sc.tl.leiden(adata, resolution=0.5, flavor="igraph",
             n_iterations=2, random_state=0)

# ── Visualize ────────────────────────────────────────────────────
sc.pl.umap(adata,
    color=["leiden", "n_genes_by_counts", "pct_counts_mt"],
    ncols=3, save="_clusters.pdf"
)

⚠️

UMAP is stochastic — always set a seed

Without random_state=42, UMAP gives a different layout every run. Cluster memberships from Leiden are also stochastic. Set seeds before generating any figure you plan to share or publish.

⚠️

Leiden resolution is a hyperparameter — more clusters ≠ better biology

Resolution controls granularity: higher = more, smaller clusters. There is no "correct" value — it depends on your biological question. Beginners often over-cluster by increasing resolution until every cluster looks distinct. Start at 0.3–0.5, validate clusters biologically before increasing. A cluster with no clear marker genes is probably artificial.

⚠️

Without random_state=42, UMAP gives a different layout every run. Cluster memberships from Leiden are also stochastic. Set seeds before generating any figure you plan to share or publish.

⚠️

Resolution controls granularity: higher = more, smaller clusters. There is no "correct" value — it depends on your biological question. Start at 0.3–0.5, validate clusters biologically before increasing.

Section 13 — Pipeline

Cell Type Annotation

Leiden clusters are just numbers. You assign biology by finding marker genes per cluster and matching to known cell type signatures.

PYTHON06_annotation.py

# ── Find markers: each cluster vs all others ─────────────────────
sc.tl.rank_genes_groups(
    adata, groupby="leiden", method="wilcoxon", use_raw=True
)
sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False)

# ── Dotplot of known markers ──────────────────────────────────────
marker_genes = {
    "CD4 T":       ["IL7R", "CCR7"],
    "CD8 T":       ["CD8A", "NKG7"],
    "NK":          ["GNLY", "NKG7"],
    "B cells":     ["MS4A1", "CD79A"],
    "CD14 Mono":   ["CD14", "LYZ"],
    "FCGR3A Mono": ["FCGR3A", "MS4A7"],
    "DC":          ["FCER1A", "CST3"],
    "Platelet":    ["PPBP"],
}
sc.pl.dotplot(adata, marker_genes, groupby="leiden")

# ── Assign labels based on dotplot ───────────────────────────────
cell_type_map = {
    "0": "CD4 T", "1": "CD14 Mono", "2": "CD4 T",
    "3": "B cells", "4": "CD8 T", "5": "NK",
    "6": "FCGR3A Mono", "7": "DC", "8": "Platelet"
}
adata.obs["cell_type"] = adata.obs["leiden"].map(cell_type_map)
sc.pl.umap(adata, color="cell_type", legend_loc="on data")

ℹ️

vs IMC annotation

IMC annotation uses a dotplot of ~38 known proteins — manual but direct. scRNA-seq uses rank_genes_groups() to discover marker genes computationally, then you interpret them. More steps but richer — you can discover unexpected markers.

ℹ️

IMC annotation uses a dotplot of ~38 known proteins — manual but direct. scRNA-seq uses rank_genes_groups() to discover marker genes computationally, then you interpret them. More steps but richer — you can discover unexpected markers.

Section 14 — Pipeline

Differential Expression

PYTHON07_deg.py

# ── DEG: CD8 T vs CD4 T ──────────────────────────────────────────
sc.tl.rank_genes_groups(
    adata, groupby="cell_type",
    groups=["CD8 T"], reference="CD4 T",
    method="wilcoxon", use_raw=True
)

# ── Extract as dataframe ──────────────────────────────────────────
deg_df = sc.get.rank_genes_groups_df(
    adata, group="CD8 T", pval_cutoff=0.05, log2fc_min=0.5
)
print(deg_df.head(20))

# ── Heatmap of top markers ────────────────────────────────────────
sc.pl.rank_genes_groups_heatmap(
    adata, n_genes=10, groupby="cell_type",
    show_gene_labels=True, save="_deg_heatmap.pdf"
)

Section 15 — Advanced

Batch Effects & Correction

PBMC 3k is a single sample so batch effects don't apply here. But in any real multi-sample dataset they are unavoidable — and failing to correct them is one of the most common mistakes in published scRNA-seq analyses.

Batch effects are technical differences between samples processed on different days, by different people, or on different instruments. Cells from batch A may cluster separately from batch B not because they're biologically different, but because of library prep variation. Without correction, you annotate technical clusters as cell types.

How to detect batch effects

Color your UMAP by sample/batch after integration. If cells from different samples don't mix within the same cell type clusters — you have batch effects that need correction.

Correction tools

PYTHONbatch_correction.py

# ── Option 1: Harmony (fast, widely used) ─────────────────────────
import harmonypy as hm
import scanpy as sc

sc.tl.pca(adata)
ho = hm.run_harmony(adata.obsm["X_pca"], adata.obs, "batch")
adata.obsm["X_pca_harmony"] = ho.Z_corr.T

# Use harmony embedding for neighbors/UMAP instead of raw PCA
sc.pp.neighbors(adata, use_rep="X_pca_harmony")
sc.tl.umap(adata)

# ── Option 2: scVI (deep learning, more powerful) ─────────────────
import scvi
scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
model = scvi.model.SCVI(adata)
model.train()
adata.obsm["X_scVI"] = model.get_latent_representation()

# ── Option 3: BBKNN (fast graph-based) ───────────────────────────
import bbknn
bbknn.bbknn(adata, batch_key="batch")  # replaces sc.pp.neighbors()

💡

Which method to use?

Harmony — start here. Fast, easy, works well for most cases. scVI — better for large datasets or strong batch effects, but slower. BBKNN — good for datasets where you want to preserve more biological variation. All three are published and widely used.

💡

Section 16 — Advanced

Common Pitfalls

These mistakes appear in published papers. Knowing them makes you a better analyst and reviewer.

Over-filtering in QC

Risk: lose rare cell types

Setting mito% cutoff too low or n_genes too strict removes rare populations. A cell type with 500 cells in 50k is 1% — easy to lose. Always check what you filtered out.

Too many PCs

Risk: noise dominates

Using n_pcs=50 when the elbow is at 15 means you're feeding noise into the neighbor graph. Check the variance ratio plot and pick PCs at the elbow.

Over-clustering

Risk: artificial cell types

Increasing resolution until clusters look distinct. Every artificial cluster will have some differentially expressed genes by chance. Validate with known biology before reporting.

Ignoring batch effects

Risk: technical clusters

In multi-sample data, uncorrected batch effects create clusters of "Sample A cells" and "Sample B cells" that get annotated as cell types. Always color UMAP by sample.

Interpreting UMAP distances

Risk: wrong conclusions

Proximity on UMAP does not equal biological similarity. Clusters that look close may be more different than clusters far apart. UMAP is for visualization only — not measurement.

No reproducibility seeds

Risk: unreproducible figures

UMAP and Leiden are stochastic. Without random_state, your figures change every run. Set seeds before any stochastic step — essential for shared or published analysis.

Section 17 — Output

Publication Figures

🔵

Annotated UMAP

All cells colored by cell type

🎻

QC violins

n_genes, UMIs, mito% per sample

🟡

Dotplot

Marker gene expression per cluster

🔥

DEG heatmap

Top markers across all clusters

📊

Composition bar

% each cell type per sample

🌋

Volcano plot

log2FC vs -log10(p)

Figure 8: Complete scRNA-seq analysis workflow. Core analysis (this tutorial) plus advanced downstream options.

🎯

Key differences from spatial proteomics (IMC)

scRNA-seq adds three steps IMC doesn't need: (1) mito QC — cells can die during dissociation; (2) HVG selection — 32k genes need filtering, but IMC feature selection happens at experimental design time when the antibody panel is chosen; (3) richer DEG — transcriptome-wide coverage. IMC adds one thing scRNA-seq can't do: spatial neighborhood analysis — where cells actually sit in tissue.