Section 01 — Concepts

What is scRNA-seq?

Single-cell RNA sequencing measures gene expression in thousands of individual cells simultaneously. Unlike bulk RNA-seq which gives an average across all cells, scRNA-seq resolves cellular heterogeneity — you identify distinct cell types, rare populations, and transcriptional states within the same sample.

Bulk RNA-seq is like blending a fruit salad and measuring the average flavor. scRNA-seq is like tasting each piece individually — you find out there are strawberries, grapes, and kiwi that the blended average would never reveal.

FeatureBulk RNA-seqscRNA-seq
ResolutionTissue averageSingle cell
Cell typesMixed togetherResolved individually
Rare populationsMaskedDetectable
SparsityLow~90% zeros
Spatial info❌ (dissociated)
Cost per sampleLowHigher
Tissue / Blood Sample PBMCs from healthy donor Cell Dissociation Single cells in suspension 10x Droplet Capture + Barcoding Cell barcode + UMI tagging per cell Illumina Sequencing → FASTQ R1: barcode+UMI · R2: cDNA Cell Ranger / STARsolo Align → deduplicate UMIs → count matrix YOUR LAPTOP ↓ Scanpy Analysis QC Normalize HVG PCA+UMAP Cluster Annotate DEG Biology
Figure 1: Complete scRNA-seq workflow from tissue to biological insights.


Section 02 — Concepts

Droplets, Barcodes & UMIs

Understanding the technology explains why your data looks the way it does — sparse, with cell barcodes and UMI counts.

The droplet system

Cells flow through a microfluidic chip and are captured one-per-droplet with a barcoded bead. Inside each droplet, RNA is reverse-transcribed and tagged with two identifiers before sequencing.

ONE OIL DROPLET Single cell mRNA inside mRNA Barcoded bead CB · UMI releases mRNA on lysis tags each mRNA molecule CB = cell barcode (which cell) · UMI = unique molecular identifier (which molecule)
Figure 2: Each droplet contains one cell + one barcoded bead. mRNA is tagged with the cell barcode and a UMI before sequencing.

What is a cell barcode?

A short DNA sequence (16–20bp) unique to each bead. After sequencing, every read carries this barcode — so you know which cell it came from. 10x Chromium uses a whitelist of ~6,700 known valid barcodes for error correction.

What is a UMI?

A random 10–12bp sequence that tags each individual mRNA molecule before PCR amplification. This solves a critical problem:

WITHOUT UMI 1 original molecule GeneX · Cell_ATCG ↓ PCR amplification 5 copies, all identical counted as 5 ✗ WITH UMI 3 original molecules GeneX · Cell · UMI_AAA GeneX · Cell · UMI_BBB GeneX · Cell · UMI_CCC ↓ PCR copies share same UMI collapsed to 1 count each 3 unique UMIs = 3 true count ✓
Figure 3: UMIs prevent PCR duplication inflation. Each original molecule gets a unique tag — duplicates share the same UMI and are collapsed to a count of 1.
ℹ️
snRNA-seq vs scRNA-seq

Some datasets use single-nucleus RNA-seq (snRNA-seq) — nuclei isolated instead of whole cells. Better for hard-to-dissociate tissues like brain. Cytoplasmic RNAs are underrepresented but the analysis pipeline is nearly identical.



Section 03 — Dataset

The PBMC 3k Dataset

3k PBMCs from a Healthy Donor
10x Genomics · Chromium v1 chemistry · Illumina NextSeq 500 · hg19 reference genome

Peripheral Blood Mononuclear Cells from a healthy donor. The canonical teaching dataset — used in the official Scanpy and Seurat tutorials. Small enough to run on any laptop in minutes.

2,700Cells
32,738Genes
~2,638Cells after QC
~2,000HVGs used
→ Download from 10x Genomics
🔬
Understanding the numbers

2,700 cells = rows. Each cell was captured in its own droplet and barcoded uniquely.

32,738 genes = columns. Every gene in the hg19 genome — but ~90% of values are zeros. This sparsity is biology, not a problem: most genes are not expressed in most cells.

2,638 after QC = 62 cells removed (high mito %, too few genes = dying or empty).

~2,000 HVGs = only the most variable genes are used for clustering. The rest add noise.

What cell types to expect in PBMCs

CD4 T cells
IL7R, CCR7, CD3D
CD8 T cells
CD8A, GZMB, NKG7
NK cells
NKG7, GNLY, KLRD1
B cells
MS4A1, CD79A, CD19
CD14 Monocytes
CD14, LYZ, CST3
FCGR3A Monocytes
FCGR3A, MS4A7
Dendritic cells
FCER1A, CST3
Platelets
PPBP, PF4


Section 04 — Pipeline

Upstream: FASTQ → Count Matrix

Cell Ranger already ran for the PBMC 3k dataset. But understanding what happened upstream explains why your matrix looks the way it does.

UPSTREAM — core facility / Cell Ranger
🧬
wet lab
10x Chromium droplet capture

Cells + barcoded beads encapsulated one-per-droplet. mRNA reverse-transcribed and tagged with cell barcode + UMI.

🔬
sequencer
Illumina sequencing → FASTQ

Read 1 (28bp) = cell barcode + UMI. Read 2 (91bp) = cDNA insert. Paired-end output: R1.fastq.gz + R2.fastq.gz.

⚙️
Cell Ranger / STARsolo
Alignment + barcode correction + UMI counting

STAR aligns reads to hg19. Cell Ranger matches barcodes to whitelist, deduplicates UMIs, identifies real cells vs empty droplets via knee plot. Output: matrix.mtx + barcodes.tsv + genes.tsv.

DOWNSTREAM — your laptop starts here ↓
ℹ️
Cell Ranger is your nf-core equivalent

Just like nf-core/rnaseq takes FASTQs and outputs a counts matrix for DESeq2, Cell Ranger takes FASTQs and outputs a sparse matrix for Scanpy. You run it once on the cluster, then do all analysis locally. The open-source alternative is STARsolo — same output format, no license needed.

STARsolo on Xanadu (SBATCH)

BASH starsolo.sh
#!/bin/bash
#SBATCH --job-name=STARsolo
#SBATCH --ntasks=8
#SBATCH --mem=64G

module load STAR/2.7.10a

STAR --runThreadN 8 \
  --genomeDir genome_index/ \
  --readFilesIn sample_R2.fastq.gz sample_R1.fastq.gz \
  --readFilesCommand zcat \
  --soloType CB_UMI_Simple \
  --soloCBwhitelist 10x_barcodes_v3.txt \
  --soloCBstart 1 --soloCBlen 16 \
  --soloUMIstart 17 --soloUMIlen 12 \
  --soloStrand Forward \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix output/sample_


Section 05 — Pipeline

The Count Matrix & AnnData

The count matrix is the output of Cell Ranger and the input to Scanpy. Understanding its structure is essential.

barcodes.tsv cell IDs genes.tsv gene symbols matrix.mtx sparse counts ANNDATA OBJECT adata.X 2700 cells × 32738 genes sparse · ~90% zeros adata.obs cell metadata (rows) n_genes, mito%, cluster adata.var gene metadata (cols) highly_variable, n_cells adata.obsm embeddings X_pca · X_umap · spatial
Figure 4: Three Cell Ranger output files load into an AnnData object. The structure is identical to the spatial proteomics tutorial — just no spatial coordinates.
ℹ️
Why ~90% zeros?

Sparsity arises from two sources. Biological: most genes are cell-type specific — insulin is only expressed in β-cells, HBB only in red blood cells. Technical: low-abundance transcripts may not get captured during library prep (dropout). Both contribute. Sparse matrix formats (.mtx) store only non-zero values, keeping file sizes manageable.



Section 06 — Setup

Environment Setup

BASHsetup.sh
conda create -n scrna python=3.10 -y
conda activate scrna

pip install scanpy anndata
pip install pandas numpy matplotlib seaborn
pip install leidenalg igraph
pip install jupyter

jupyter notebook


Section 07 — Pipeline

Load the Data

PYTHON01_load.py
import scanpy as sc
import numpy as np

sc.settings.verbosity = 3
sc.set_figure_params(dpi=80, facecolor="white")

# ── Option A: auto-download (easiest) ────────────────────────────
adata = sc.datasets.pbmc3k()

# ── Option B: load from Cell Ranger output ────────────────────────
adata = sc.read_10x_mtx(
    "data/filtered_gene_bc_matrices/hg19/",
    var_names="gene_symbols",
    cache=True
)
adata.var_names_make_unique()

print(adata)
# AnnData object with n_obs × n_vars = 2700 × 32738


Section 08 — Pipeline

The Mathematical Goal

Before diving into code, it helps to understand the mathematical goal of the entire pipeline.

We start with a sparse, high-dimensional count matrix — 2,700 cells × 32,738 genes. Each cell is a point in 32,738-dimensional space. Our goal is to transform this into a low-dimensional representation where distance reflects biological similarity. Cells of the same type should cluster together; different types should be far apart. Everything — normalization, HVG selection, PCA, UMAP, Leiden — serves this single objective.

🧠
The transformation pipeline in one sentence

Raw counts → correct for technical noise (QC, normalization) → reduce dimensions (HVG, PCA) → embed in 2D (UMAP) → find communities (Leiden) → assign biology (annotation). Each step removes noise and preserves signal.

⚠️
UMAP distances are not absolute

UMAP is a visualization tool, not a quantitative embedding. The distance between two clusters on a UMAP does not reflect how biologically different they are. Two clusters far apart on UMAP may be more similar than two clusters that appear close. Use UMAP to see structure — use quantitative methods (DEG, pseudotime) to measure it.



Section 09 — Pipeline

QC & Filtering

QC removes low-quality cells. Three metrics identify problems. This step is unique to scRNA-seq — IMC doesn't need it because tissue is fixed, not dissociated.

⚠️
Thresholds are dataset-dependent — always look at distributions first

The values below (200/2500/5%) are reasonable starting points for PBMC data. Tumor cells legitimately have more genes. Brain cells have higher baseline mito%. Always inspect violin plots before filtering — cutting too aggressively removes real rare cell populations.

n_genes_by_counts
Filter: <200 or >2500

Too few = empty droplet or dead cell. Too many = doublet (two cells captured together).

total_counts
Filter: outliers

Total UMIs per cell. Correlates with n_genes. Extremes = doublets or poor quality.

pct_counts_mt
Filter: >5%

% mitochondrial genes. High = dying cell. Cytoplasmic RNA leaks out; mt-RNA stays. No equivalent in IMC.

Healthy cell n_genes: 500–2000 mito%: <5% ✓ keep Dying cell n_genes: <200 mito%: >20% cytoplasm leaks out ✗ remove Doublet n_genes: >2500 2 cells in 1 droplet ✗ remove
Figure 5: Three cell fates in QC. Healthy cells pass all thresholds. Dying cells show high mito% (cytoplasm leaks out, mitochondria stay). Doublets show abnormally high gene counts.
PYTHON02_qc.py
# ── Flag mitochondrial genes ──────────────────────────────────────
adata.var["mt"] = adata.var_names.str.startswith("MT-")

# ── Calculate QC metrics ──────────────────────────────────────────
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)

# ── Visualize ────────────────────────────────────────────────────
sc.pl.violin(adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4, multi_panel=True, save="_qc.pdf"
)

# ── Filter ────────────────────────────────────────────────────────
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_cells(adata, max_genes=2500)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.pct_counts_mt < 5]

print(f"After QC: {adata.n_obs} cells")   # ~2638
ℹ️
Doublet detection — go beyond n_genes threshold

The n_genes >2500 filter catches obvious doublets but misses subtle ones. For rigorous analysis, use a computational doublet detector: pip install scrublet — it simulates artificial doublets and scores each cell for doublet probability. Recommended for any dataset going to publication.

ℹ️
Cell cycle effects

Actively cycling cells express cell cycle genes (S and G2M phase markers) that can dominate clustering — you end up with clusters defined by cell cycle state rather than cell type. Score cell cycle phases with sc.tl.score_genes_cell_cycle() and optionally regress them out before PCA.


Section 10 — Pipeline

Normalization

Cells are sequenced to different depths. Without correction, deeply-sequenced cells look like they express more of everything — a technical artifact, not biology. Normalization adjusts for sequencing depth, though it does not fully correct compositional bias — if one gene is very highly expressed it can still influence the scaling of others. For most analyses this is acceptable; for sensitive comparisons consider SCTransform.

BEFORE Cell A 10,000 UMIs GeneX: 100 GeneY: 50 Cell B 1,000 UMIs GeneX: 10 GeneY: 5 A looks 10× higher same relative expression! normalize + log1p AFTER Cell A normalized GeneX: 1.21 GeneY: 0.92 Cell B normalized GeneX: 1.21 GeneY: 0.92 same expression = same value cells now comparable ✓
Figure 6: Library size correction then log1p transform makes cells with different sequencing depths directly comparable.
PYTHON03_normalize.py
# ── Store raw counts before transformation ────────────────────────
adata.layers["counts"] = adata.X.copy()
adata.raw = adata          # freeze raw — needed for DEG later

# ── Normalize to 10,000 counts per cell ───────────────────────────
sc.pp.normalize_total(adata, target_sum=1e4)

# ── log1p transform ───────────────────────────────────────────────
# log1p(x) = log(x+1) handles zeros, compresses dynamic range
sc.pp.log1p(adata)

Section 11 — Pipeline

Highly Variable Genes

With 32,738 genes, most add noise. HVGs are genes with higher variance than expected given their mean expression — this mean-variance correction is important because high-count genes naturally show more variance. Simply taking the most variable genes without this correction would always select highly expressed housekeeping genes. Unlike scRNA-seq, IMC measures a predefined protein panel, so feature selection occurs during experimental design rather than analysis.

PYTHON04_hvg.py
# ── Find top 2000 most variable genes ────────────────────────────
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=2000,
    min_mean=0.0125, max_mean=3, min_disp=0.5,
    flavor="seurat_v3",
    layer="counts"
)
sc.pl.highly_variable_genes(adata, save="_hvg.pdf")

print(f"HVGs: {adata.var.highly_variable.sum()} / {adata.n_vars}")

# ── Scale to unit variance before PCA ────────────────────────────
# Centers each gene mean=0, variance=1
# Clip outliers at 10 std devs
sc.pp.scale(adata, max_value=10)

Section 12 — Pipeline

PCA, UMAP & Clustering

Even 2,000 HVGs is too high-dimensional to cluster directly. PCA reduces to ~50 components, then UMAP projects to 2D for visualization. Leiden finds communities in the neighbor graph.

2000 HVGs high-dimensional PCA 50 PCs major variance axes k-NN Neighbor graph cells linked by similarity UMAP Leiden 2D UMAP + clusters visualize cell populations identify cell types
Figure 7: Dimensionality reduction pipeline. 2000 genes → 50 PCs (linear) → neighbor graph → UMAP (2D) + Leiden clusters.
PYTHON05_clustering.py
# ── PCA — only uses HVGs automatically ───────────────────────────
sc.tl.pca(adata, svd_solver="arpack", n_comps=50)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)   # check elbow

# ── Neighbor graph ────────────────────────────────────────────────
# n_pcs=40 standard for PBMC; adjust based on elbow plot
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

# ── UMAP ──────────────────────────────────────────────────────────
sc.tl.umap(adata, random_state=42)

# ── Leiden clustering ─────────────────────────────────────────────
# flavor="igraph" avoids leidenalg dependency
sc.tl.leiden(adata, resolution=0.5, flavor="igraph",
             n_iterations=2, random_state=0)

# ── Visualize ────────────────────────────────────────────────────
sc.pl.umap(adata,
    color=["leiden", "n_genes_by_counts", "pct_counts_mt"],
    ncols=3, save="_clusters.pdf"
)
⚠️
UMAP is stochastic — always set a seed

Without random_state=42, UMAP gives a different layout every run. Cluster memberships from Leiden are also stochastic. Set seeds before generating any figure you plan to share or publish.

⚠️
Leiden resolution is a hyperparameter — more clusters ≠ better biology

Resolution controls granularity: higher = more, smaller clusters. There is no "correct" value — it depends on your biological question. Beginners often over-cluster by increasing resolution until every cluster looks distinct. Start at 0.3–0.5, validate clusters biologically before increasing. A cluster with no clear marker genes is probably artificial.

⚠️

Without random_state=42, UMAP gives a different layout every run. Cluster memberships from Leiden are also stochastic. Set seeds before generating any figure you plan to share or publish.

⚠️

Resolution controls granularity: higher = more, smaller clusters. There is no "correct" value — it depends on your biological question. Start at 0.3–0.5, validate clusters biologically before increasing.


Section 13 — Pipeline

Cell Type Annotation

Leiden clusters are just numbers. You assign biology by finding marker genes per cluster and matching to known cell type signatures.

PYTHON06_annotation.py
# ── Find markers: each cluster vs all others ─────────────────────
sc.tl.rank_genes_groups(
    adata, groupby="leiden", method="wilcoxon", use_raw=True
)
sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False)

# ── Dotplot of known markers ──────────────────────────────────────
marker_genes = {
    "CD4 T":       ["IL7R", "CCR7"],
    "CD8 T":       ["CD8A", "NKG7"],
    "NK":          ["GNLY", "NKG7"],
    "B cells":     ["MS4A1", "CD79A"],
    "CD14 Mono":   ["CD14", "LYZ"],
    "FCGR3A Mono": ["FCGR3A", "MS4A7"],
    "DC":          ["FCER1A", "CST3"],
    "Platelet":    ["PPBP"],
}
sc.pl.dotplot(adata, marker_genes, groupby="leiden")

# ── Assign labels based on dotplot ───────────────────────────────
cell_type_map = {
    "0": "CD4 T", "1": "CD14 Mono", "2": "CD4 T",
    "3": "B cells", "4": "CD8 T", "5": "NK",
    "6": "FCGR3A Mono", "7": "DC", "8": "Platelet"
}
adata.obs["cell_type"] = adata.obs["leiden"].map(cell_type_map)
sc.pl.umap(adata, color="cell_type", legend_loc="on data")
ℹ️
vs IMC annotation

IMC annotation uses a dotplot of ~38 known proteins — manual but direct. scRNA-seq uses rank_genes_groups() to discover marker genes computationally, then you interpret them. More steps but richer — you can discover unexpected markers.

ℹ️

IMC annotation uses a dotplot of ~38 known proteins — manual but direct. scRNA-seq uses rank_genes_groups() to discover marker genes computationally, then you interpret them. More steps but richer — you can discover unexpected markers.


Section 14 — Pipeline

Differential Expression

PYTHON07_deg.py
# ── DEG: CD8 T vs CD4 T ──────────────────────────────────────────
sc.tl.rank_genes_groups(
    adata, groupby="cell_type",
    groups=["CD8 T"], reference="CD4 T",
    method="wilcoxon", use_raw=True
)

# ── Extract as dataframe ──────────────────────────────────────────
deg_df = sc.get.rank_genes_groups_df(
    adata, group="CD8 T", pval_cutoff=0.05, log2fc_min=0.5
)
print(deg_df.head(20))

# ── Heatmap of top markers ────────────────────────────────────────
sc.pl.rank_genes_groups_heatmap(
    adata, n_genes=10, groupby="cell_type",
    show_gene_labels=True, save="_deg_heatmap.pdf"
)

Section 15 — Advanced

Batch Effects & Correction

PBMC 3k is a single sample so batch effects don't apply here. But in any real multi-sample dataset they are unavoidable — and failing to correct them is one of the most common mistakes in published scRNA-seq analyses.

Batch effects are technical differences between samples processed on different days, by different people, or on different instruments. Cells from batch A may cluster separately from batch B not because they're biologically different, but because of library prep variation. Without correction, you annotate technical clusters as cell types.

How to detect batch effects

Color your UMAP by sample/batch after integration. If cells from different samples don't mix within the same cell type clusters — you have batch effects that need correction.

Correction tools

PYTHONbatch_correction.py
# ── Option 1: Harmony (fast, widely used) ─────────────────────────
import harmonypy as hm
import scanpy as sc

sc.tl.pca(adata)
ho = hm.run_harmony(adata.obsm["X_pca"], adata.obs, "batch")
adata.obsm["X_pca_harmony"] = ho.Z_corr.T

# Use harmony embedding for neighbors/UMAP instead of raw PCA
sc.pp.neighbors(adata, use_rep="X_pca_harmony")
sc.tl.umap(adata)

# ── Option 2: scVI (deep learning, more powerful) ─────────────────
import scvi
scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
model = scvi.model.SCVI(adata)
model.train()
adata.obsm["X_scVI"] = model.get_latent_representation()

# ── Option 3: BBKNN (fast graph-based) ───────────────────────────
import bbknn
bbknn.bbknn(adata, batch_key="batch")  # replaces sc.pp.neighbors()
💡
Which method to use?

Harmony — start here. Fast, easy, works well for most cases. scVI — better for large datasets or strong batch effects, but slower. BBKNN — good for datasets where you want to preserve more biological variation. All three are published and widely used.

💡

Harmony — start here. Fast, easy, works well for most cases. scVI — better for large datasets or strong batch effects, but slower. BBKNN — good for datasets where you want to preserve more biological variation. All three are published and widely used.


Section 16 — Advanced

Common Pitfalls

These mistakes appear in published papers. Knowing them makes you a better analyst and reviewer.

Over-filtering in QC
Risk: lose rare cell types

Setting mito% cutoff too low or n_genes too strict removes rare populations. A cell type with 500 cells in 50k is 1% — easy to lose. Always check what you filtered out.

Too many PCs
Risk: noise dominates

Using n_pcs=50 when the elbow is at 15 means you're feeding noise into the neighbor graph. Check the variance ratio plot and pick PCs at the elbow.

Over-clustering
Risk: artificial cell types

Increasing resolution until clusters look distinct. Every artificial cluster will have some differentially expressed genes by chance. Validate with known biology before reporting.

Ignoring batch effects
Risk: technical clusters

In multi-sample data, uncorrected batch effects create clusters of "Sample A cells" and "Sample B cells" that get annotated as cell types. Always color UMAP by sample.

Interpreting UMAP distances
Risk: wrong conclusions

Proximity on UMAP does not equal biological similarity. Clusters that look close may be more different than clusters far apart. UMAP is for visualization only — not measurement.

No reproducibility seeds
Risk: unreproducible figures

UMAP and Leiden are stochastic. Without random_state, your figures change every run. Set seeds before any stochastic step — essential for shared or published analysis.


Section 17 — Output

Publication Figures

🔵
Annotated UMAP
All cells colored by cell type
🎻
QC violins
n_genes, UMIs, mito% per sample
🟡
Dotplot
Marker gene expression per cluster
🔥
DEG heatmap
Top markers across all clusters
📊
Composition bar
% each cell type per sample
🌋
Volcano plot
log2FC vs -log10(p)
CORE ANALYSIS (THIS TUTORIAL) Standard QC & Filtering Normalization Clustering Annotation Find markers Label clusters Validate Comparison Diff expression Cell type DEG GO enrichment Visualization UMAP Heatmaps Dot plots ADVANCED ANALYSIS CellChat Ligand-receptor Cell communication Network analysis Trajectory Pseudotime Development State transitions Batch correction Harmony scVI Integration Biological Insights publication · hypotheses · discovery
Figure 8: Complete scRNA-seq analysis workflow. Core analysis (this tutorial) plus advanced downstream options.
🎯
Key differences from spatial proteomics (IMC)

scRNA-seq adds three steps IMC doesn't need: (1) mito QC — cells can die during dissociation; (2) HVG selection — 32k genes need filtering, but IMC feature selection happens at experimental design time when the antibody panel is chosen; (3) richer DEG — transcriptome-wide coverage. IMC adds one thing scRNA-seq can't do: spatial neighborhood analysis — where cells actually sit in tissue.