Analysis from Scratch
A complete guide from what a droplet actually is to a fully annotated UMAP with DEG analysis. Uses the PBMC 3k dataset. Every decision explained.
What is scRNA-seq?
Single-cell RNA sequencing measures gene expression in thousands of individual cells simultaneously. Unlike bulk RNA-seq which gives an average across all cells, scRNA-seq resolves cellular heterogeneity — you identify distinct cell types, rare populations, and transcriptional states within the same sample.
Bulk RNA-seq is like blending a fruit salad and measuring the average flavor. scRNA-seq is like tasting each piece individually — you find out there are strawberries, grapes, and kiwi that the blended average would never reveal.
| Feature | Bulk RNA-seq | scRNA-seq |
|---|---|---|
| Resolution | Tissue average | Single cell |
| Cell types | Mixed together | Resolved individually |
| Rare populations | Masked | Detectable |
| Sparsity | Low | ~90% zeros |
| Spatial info | ❌ | ❌ (dissociated) |
| Cost per sample | Low | Higher |
Droplets, Barcodes & UMIs
Understanding the technology explains why your data looks the way it does — sparse, with cell barcodes and UMI counts.
The droplet system
Cells flow through a microfluidic chip and are captured one-per-droplet with a barcoded bead. Inside each droplet, RNA is reverse-transcribed and tagged with two identifiers before sequencing.
What is a cell barcode?
A short DNA sequence (16–20bp) unique to each bead. After sequencing, every read carries this barcode — so you know which cell it came from. 10x Chromium uses a whitelist of ~6,700 known valid barcodes for error correction.
What is a UMI?
A random 10–12bp sequence that tags each individual mRNA molecule before PCR amplification. This solves a critical problem:
Some datasets use single-nucleus RNA-seq (snRNA-seq) — nuclei isolated instead of whole cells. Better for hard-to-dissociate tissues like brain. Cytoplasmic RNAs are underrepresented but the analysis pipeline is nearly identical.
The PBMC 3k Dataset
Peripheral Blood Mononuclear Cells from a healthy donor. The canonical teaching dataset — used in the official Scanpy and Seurat tutorials. Small enough to run on any laptop in minutes.
2,700 cells = rows. Each cell was captured in its own droplet and barcoded uniquely.
32,738 genes = columns. Every gene in the hg19 genome — but ~90% of values are zeros. This sparsity is biology, not a problem: most genes are not expressed in most cells.
2,638 after QC = 62 cells removed (high mito %, too few genes = dying or empty).
~2,000 HVGs = only the most variable genes are used for clustering. The rest add noise.
What cell types to expect in PBMCs
Upstream: FASTQ → Count Matrix
Cell Ranger already ran for the PBMC 3k dataset. But understanding what happened upstream explains why your matrix looks the way it does.
UPSTREAM — core facility / Cell RangerCells + barcoded beads encapsulated one-per-droplet. mRNA reverse-transcribed and tagged with cell barcode + UMI.
Read 1 (28bp) = cell barcode + UMI. Read 2 (91bp) = cDNA insert. Paired-end output: R1.fastq.gz + R2.fastq.gz.
STAR aligns reads to hg19. Cell Ranger matches barcodes to whitelist, deduplicates UMIs, identifies real cells vs empty droplets via knee plot. Output: matrix.mtx + barcodes.tsv + genes.tsv.
Just like nf-core/rnaseq takes FASTQs and outputs a counts matrix for DESeq2, Cell Ranger takes FASTQs and outputs a sparse matrix for Scanpy. You run it once on the cluster, then do all analysis locally. The open-source alternative is STARsolo — same output format, no license needed.
STARsolo on Xanadu (SBATCH)
#!/bin/bash #SBATCH --job-name=STARsolo #SBATCH --ntasks=8 #SBATCH --mem=64G module load STAR/2.7.10a STAR --runThreadN 8 \ --genomeDir genome_index/ \ --readFilesIn sample_R2.fastq.gz sample_R1.fastq.gz \ --readFilesCommand zcat \ --soloType CB_UMI_Simple \ --soloCBwhitelist 10x_barcodes_v3.txt \ --soloCBstart 1 --soloCBlen 16 \ --soloUMIstart 17 --soloUMIlen 12 \ --soloStrand Forward \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix output/sample_
The Count Matrix & AnnData
The count matrix is the output of Cell Ranger and the input to Scanpy. Understanding its structure is essential.
Sparsity arises from two sources. Biological: most genes are cell-type specific — insulin is only expressed in β-cells, HBB only in red blood cells. Technical: low-abundance transcripts may not get captured during library prep (dropout). Both contribute. Sparse matrix formats (.mtx) store only non-zero values, keeping file sizes manageable.
Environment Setup
conda create -n scrna python=3.10 -y conda activate scrna pip install scanpy anndata pip install pandas numpy matplotlib seaborn pip install leidenalg igraph pip install jupyter jupyter notebook
Load the Data
import scanpy as sc import numpy as np sc.settings.verbosity = 3 sc.set_figure_params(dpi=80, facecolor="white") # ── Option A: auto-download (easiest) ──────────────────────────── adata = sc.datasets.pbmc3k() # ── Option B: load from Cell Ranger output ──────────────────────── adata = sc.read_10x_mtx( "data/filtered_gene_bc_matrices/hg19/", var_names="gene_symbols", cache=True ) adata.var_names_make_unique() print(adata) # AnnData object with n_obs × n_vars = 2700 × 32738
The Mathematical Goal
Before diving into code, it helps to understand the mathematical goal of the entire pipeline.
We start with a sparse, high-dimensional count matrix — 2,700 cells × 32,738 genes. Each cell is a point in 32,738-dimensional space. Our goal is to transform this into a low-dimensional representation where distance reflects biological similarity. Cells of the same type should cluster together; different types should be far apart. Everything — normalization, HVG selection, PCA, UMAP, Leiden — serves this single objective.
Raw counts → correct for technical noise (QC, normalization) → reduce dimensions (HVG, PCA) → embed in 2D (UMAP) → find communities (Leiden) → assign biology (annotation). Each step removes noise and preserves signal.
UMAP is a visualization tool, not a quantitative embedding. The distance between two clusters on a UMAP does not reflect how biologically different they are. Two clusters far apart on UMAP may be more similar than two clusters that appear close. Use UMAP to see structure — use quantitative methods (DEG, pseudotime) to measure it.
QC & Filtering
QC removes low-quality cells. Three metrics identify problems. This step is unique to scRNA-seq — IMC doesn't need it because tissue is fixed, not dissociated.
The values below (200/2500/5%) are reasonable starting points for PBMC data. Tumor cells legitimately have more genes. Brain cells have higher baseline mito%. Always inspect violin plots before filtering — cutting too aggressively removes real rare cell populations.
Too few = empty droplet or dead cell. Too many = doublet (two cells captured together).
Total UMIs per cell. Correlates with n_genes. Extremes = doublets or poor quality.
% mitochondrial genes. High = dying cell. Cytoplasmic RNA leaks out; mt-RNA stays. No equivalent in IMC.
# ── Flag mitochondrial genes ────────────────────────────────────── adata.var["mt"] = adata.var_names.str.startswith("MT-") # ── Calculate QC metrics ────────────────────────────────────────── sc.pp.calculate_qc_metrics( adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True ) # ── Visualize ──────────────────────────────────────────────────── sc.pl.violin(adata, ["n_genes_by_counts", "total_counts", "pct_counts_mt"], jitter=0.4, multi_panel=True, save="_qc.pdf" ) # ── Filter ──────────────────────────────────────────────────────── sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_cells(adata, max_genes=2500) sc.pp.filter_genes(adata, min_cells=3) adata = adata[adata.obs.pct_counts_mt < 5] print(f"After QC: {adata.n_obs} cells") # ~2638
The n_genes >2500 filter catches obvious doublets but misses subtle ones. For rigorous analysis, use a computational doublet detector: pip install scrublet — it simulates artificial doublets and scores each cell for doublet probability. Recommended for any dataset going to publication.
Actively cycling cells express cell cycle genes (S and G2M phase markers) that can dominate clustering — you end up with clusters defined by cell cycle state rather than cell type. Score cell cycle phases with sc.tl.score_genes_cell_cycle() and optionally regress them out before PCA.
Normalization
Cells are sequenced to different depths. Without correction, deeply-sequenced cells look like they express more of everything — a technical artifact, not biology. Normalization adjusts for sequencing depth, though it does not fully correct compositional bias — if one gene is very highly expressed it can still influence the scaling of others. For most analyses this is acceptable; for sensitive comparisons consider SCTransform.
# ── Store raw counts before transformation ──────────────────────── adata.layers["counts"] = adata.X.copy() adata.raw = adata # freeze raw — needed for DEG later # ── Normalize to 10,000 counts per cell ─────────────────────────── sc.pp.normalize_total(adata, target_sum=1e4) # ── log1p transform ─────────────────────────────────────────────── # log1p(x) = log(x+1) handles zeros, compresses dynamic range sc.pp.log1p(adata)
Highly Variable Genes
With 32,738 genes, most add noise. HVGs are genes with higher variance than expected given their mean expression — this mean-variance correction is important because high-count genes naturally show more variance. Simply taking the most variable genes without this correction would always select highly expressed housekeeping genes. Unlike scRNA-seq, IMC measures a predefined protein panel, so feature selection occurs during experimental design rather than analysis.
# ── Find top 2000 most variable genes ──────────────────────────── sc.pp.highly_variable_genes( adata, n_top_genes=2000, min_mean=0.0125, max_mean=3, min_disp=0.5, flavor="seurat_v3", layer="counts" ) sc.pl.highly_variable_genes(adata, save="_hvg.pdf") print(f"HVGs: {adata.var.highly_variable.sum()} / {adata.n_vars}") # ── Scale to unit variance before PCA ──────────────────────────── # Centers each gene mean=0, variance=1 # Clip outliers at 10 std devs sc.pp.scale(adata, max_value=10)
PCA, UMAP & Clustering
Even 2,000 HVGs is too high-dimensional to cluster directly. PCA reduces to ~50 components, then UMAP projects to 2D for visualization. Leiden finds communities in the neighbor graph.
# ── PCA — only uses HVGs automatically ─────────────────────────── sc.tl.pca(adata, svd_solver="arpack", n_comps=50) sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50) # check elbow # ── Neighbor graph ──────────────────────────────────────────────── # n_pcs=40 standard for PBMC; adjust based on elbow plot sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40) # ── UMAP ────────────────────────────────────────────────────────── sc.tl.umap(adata, random_state=42) # ── Leiden clustering ───────────────────────────────────────────── # flavor="igraph" avoids leidenalg dependency sc.tl.leiden(adata, resolution=0.5, flavor="igraph", n_iterations=2, random_state=0) # ── Visualize ──────────────────────────────────────────────────── sc.pl.umap(adata, color=["leiden", "n_genes_by_counts", "pct_counts_mt"], ncols=3, save="_clusters.pdf" )
Without random_state=42, UMAP gives a different layout every run. Cluster memberships from Leiden are also stochastic. Set seeds before generating any figure you plan to share or publish.
Resolution controls granularity: higher = more, smaller clusters. There is no "correct" value — it depends on your biological question. Beginners often over-cluster by increasing resolution until every cluster looks distinct. Start at 0.3–0.5, validate clusters biologically before increasing. A cluster with no clear marker genes is probably artificial.
Without random_state=42, UMAP gives a different layout every run. Cluster memberships from Leiden are also stochastic. Set seeds before generating any figure you plan to share or publish.
Resolution controls granularity: higher = more, smaller clusters. There is no "correct" value — it depends on your biological question. Start at 0.3–0.5, validate clusters biologically before increasing.
Cell Type Annotation
Leiden clusters are just numbers. You assign biology by finding marker genes per cluster and matching to known cell type signatures.
# ── Find markers: each cluster vs all others ───────────────────── sc.tl.rank_genes_groups( adata, groupby="leiden", method="wilcoxon", use_raw=True ) sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False) # ── Dotplot of known markers ────────────────────────────────────── marker_genes = { "CD4 T": ["IL7R", "CCR7"], "CD8 T": ["CD8A", "NKG7"], "NK": ["GNLY", "NKG7"], "B cells": ["MS4A1", "CD79A"], "CD14 Mono": ["CD14", "LYZ"], "FCGR3A Mono": ["FCGR3A", "MS4A7"], "DC": ["FCER1A", "CST3"], "Platelet": ["PPBP"], } sc.pl.dotplot(adata, marker_genes, groupby="leiden") # ── Assign labels based on dotplot ─────────────────────────────── cell_type_map = { "0": "CD4 T", "1": "CD14 Mono", "2": "CD4 T", "3": "B cells", "4": "CD8 T", "5": "NK", "6": "FCGR3A Mono", "7": "DC", "8": "Platelet" } adata.obs["cell_type"] = adata.obs["leiden"].map(cell_type_map) sc.pl.umap(adata, color="cell_type", legend_loc="on data")
IMC annotation uses a dotplot of ~38 known proteins — manual but direct. scRNA-seq uses rank_genes_groups() to discover marker genes computationally, then you interpret them. More steps but richer — you can discover unexpected markers.
IMC annotation uses a dotplot of ~38 known proteins — manual but direct. scRNA-seq uses rank_genes_groups() to discover marker genes computationally, then you interpret them. More steps but richer — you can discover unexpected markers.
Differential Expression
# ── DEG: CD8 T vs CD4 T ────────────────────────────────────────── sc.tl.rank_genes_groups( adata, groupby="cell_type", groups=["CD8 T"], reference="CD4 T", method="wilcoxon", use_raw=True ) # ── Extract as dataframe ────────────────────────────────────────── deg_df = sc.get.rank_genes_groups_df( adata, group="CD8 T", pval_cutoff=0.05, log2fc_min=0.5 ) print(deg_df.head(20)) # ── Heatmap of top markers ──────────────────────────────────────── sc.pl.rank_genes_groups_heatmap( adata, n_genes=10, groupby="cell_type", show_gene_labels=True, save="_deg_heatmap.pdf" )
Batch Effects & Correction
PBMC 3k is a single sample so batch effects don't apply here. But in any real multi-sample dataset they are unavoidable — and failing to correct them is one of the most common mistakes in published scRNA-seq analyses.
Batch effects are technical differences between samples processed on different days, by different people, or on different instruments. Cells from batch A may cluster separately from batch B not because they're biologically different, but because of library prep variation. Without correction, you annotate technical clusters as cell types.
How to detect batch effects
Color your UMAP by sample/batch after integration. If cells from different samples don't mix within the same cell type clusters — you have batch effects that need correction.
Correction tools
# ── Option 1: Harmony (fast, widely used) ───────────────────────── import harmonypy as hm import scanpy as sc sc.tl.pca(adata) ho = hm.run_harmony(adata.obsm["X_pca"], adata.obs, "batch") adata.obsm["X_pca_harmony"] = ho.Z_corr.T # Use harmony embedding for neighbors/UMAP instead of raw PCA sc.pp.neighbors(adata, use_rep="X_pca_harmony") sc.tl.umap(adata) # ── Option 2: scVI (deep learning, more powerful) ───────────────── import scvi scvi.model.SCVI.setup_anndata(adata, batch_key="batch") model = scvi.model.SCVI(adata) model.train() adata.obsm["X_scVI"] = model.get_latent_representation() # ── Option 3: BBKNN (fast graph-based) ─────────────────────────── import bbknn bbknn.bbknn(adata, batch_key="batch") # replaces sc.pp.neighbors()
Harmony — start here. Fast, easy, works well for most cases. scVI — better for large datasets or strong batch effects, but slower. BBKNN — good for datasets where you want to preserve more biological variation. All three are published and widely used.
Harmony — start here. Fast, easy, works well for most cases. scVI — better for large datasets or strong batch effects, but slower. BBKNN — good for datasets where you want to preserve more biological variation. All three are published and widely used.
Common Pitfalls
These mistakes appear in published papers. Knowing them makes you a better analyst and reviewer.
Setting mito% cutoff too low or n_genes too strict removes rare populations. A cell type with 500 cells in 50k is 1% — easy to lose. Always check what you filtered out.
Using n_pcs=50 when the elbow is at 15 means you're feeding noise into the neighbor graph. Check the variance ratio plot and pick PCs at the elbow.
Increasing resolution until clusters look distinct. Every artificial cluster will have some differentially expressed genes by chance. Validate with known biology before reporting.
In multi-sample data, uncorrected batch effects create clusters of "Sample A cells" and "Sample B cells" that get annotated as cell types. Always color UMAP by sample.
Proximity on UMAP does not equal biological similarity. Clusters that look close may be more different than clusters far apart. UMAP is for visualization only — not measurement.
UMAP and Leiden are stochastic. Without random_state, your figures change every run. Set seeds before any stochastic step — essential for shared or published analysis.
Publication Figures
scRNA-seq adds three steps IMC doesn't need: (1) mito QC — cells can die during dissociation; (2) HVG selection — 32k genes need filtering, but IMC feature selection happens at experimental design time when the antibody panel is chosen; (3) richer DEG — transcriptome-wide coverage. IMC adds one thing scRNA-seq can't do: spatial neighborhood analysis — where cells actually sit in tissue.