Section 01 — Concepts

What is Spatial Proteomics / IMC?

Imaging Mass Cytometry (IMC) uses metal-tagged antibodies and a laser to simultaneously measure 40+ proteins per cell while preserving tissue architecture. Every cell gets an X/Y coordinate plus a protein expression vector — the combination is what makes it "spatial."

Think of it like scRNA-seq but instead of sequencing reads, you're measuring metal ion intensities from laser ablation. And instead of losing spatial context when you dissociate cells, every cell stays pinned to its location in the tissue.
Tissue section laser ablation Mass spec metal ion detection Cell x Protein matrix + X/Y coordinates CK5 CD45 HER2 X · Y Cell_1 3.2 0.1 0.4 124·88 Cell_2 0.2 4.8 0.1 256·142 Cell_3 0.1 0.3 6.1 310·98 ~700K cells x 38 proteins + spatial coords
Figure 1 — IMC workflow. Tissue stained with metal-tagged antibodies. Laser ablates spot by spot. Mass spec detects metal ions per spot. Output: cell × protein matrix with X/Y coordinates for every cell.

IMC vs scRNA-seq — key differences

FeaturescRNA-seqIMC / Spatial Proteomics
InputDissociated cellsIntact tissue section
MeasurementRNA transcriptsProtein intensity (metal tags)
Feature count~20,000 genes~40 proteins (panel-defined)
Spatial context✗ Lost on dissociation✓ X/Y coordinates preserved
Cell proximity✗ Inferred only✓ Directly measured
NormalizationLibrary size → log1parcsinh with cofactor
QCMito%, n_genes, UMIsCell area, signal quality
AnnotationComputational (marker genes)Manual (known protein panel)

Section 02 — Dataset

The Jackson 2020 Dataset

The Single-Cell Pathology Landscape of Breast Cancer
Jackson, Fischer et al. · Nature 578:615–620 · 2020 · Bodenmiller Lab, University of Zurich

IMC profiling of breast cancer tissue microarrays. Pre-segmented cell × protein matrices and spatial coordinates are publicly available on Zenodo. The gold-standard public IMC dataset — used as the canonical benchmark in Squidpy, CATALYST, and steinbock tutorials.

352
Patients
~700K
Cells total
38
Proteins
~280
Tumor cores (ROIs)
→ Zenodo DOI: 10.5281/zenodo.3518284

Key proteins in the panel

CK5
Basal epithelium
CK8/18
Luminal epithelium
HER2
Tumor subtype
ER
Tumor subtype
CD45
Pan-immune
CD8a
Cytotoxic T cells
CD68
Macrophages
CD20
B cells
Vimentin
Stromal / EMT
SMA
Smooth muscle
Ki67
Proliferation
p53
Tumor suppressor
💡
Why this dataset is ideal for learning spatial proteomics

Large enough to be realistic, rich biological signal (tumor/immune/stromal interactions), comes pre-segmented so you skip the hardest step, and is the canonical benchmark for spatial proteomics tools. Every tutorial from Squidpy, CATALYST, and steinbock uses it.


Section 03 — Concepts

Upstream Pipeline (for reference)

You're skipping this for the Jackson dataset — you download pre-segmented data directly from Zenodo. But understanding what happened upstream explains the data structure you receive.

Staining metal-tagged Ab wet lab IMC acquisition Hyperion → .mcd Fluidigm Hyperion Image export readimc → TIFFs readimc / steinbock Cell segmentation Mesmer → masks GPU cluster CORE FACILITY ↑     YOUR LAPTOP ↓ Feature extraction mean intensity per cell steinbock Spillover correction channel bleed matrix CATALYST (R) Your starting point sc_dat.csv — cell x protein counts sc_meta.csv — X/Y + patient ID downloaded from Zenodo for Jackson data
Figure 2 — IMC upstream pipeline. The Jackson dataset skips everything above the dashed line — you download pre-extracted feature matrices directly from Zenodo.
⚠️
Cell segmentation is the hardest and most consequential step

Poor segmentation propagates through everything downstream. A cell boundary that bleeds into a neighbor contaminates protein measurements. Jackson et al. used CellProfiler — modern pipelines use Mesmer (deep learning), which performs better in crowded tissue. For your own data, always visually inspect segmentation masks on a few ROIs before proceeding.


Section 04 — Setup

Environment Setup

Bashsetup.sh
conda create -n spatial python=3.10 -y
conda activate spatial

pip install squidpy scanpy anndata
pip install pandas numpy matplotlib seaborn
pip install leidenalg igraph
pip install jupyter

jupyter notebook
ℹ️
Squidpy vs Scanpy — what each does

Scanpy handles the single-cell analysis: QC, normalization, PCA, UMAP, Leiden clustering. This is identical to the scRNA-seq workflow. Squidpy adds the spatial layer on top: neighborhood enrichment, co-occurrence, spatial statistics. You need both — Squidpy is built on Scanpy's AnnData format.


Section 05 — Pipeline

Load the Pre-segmented Data

The Jackson dataset comes as pre-extracted CSV files. You build an AnnData object — the same format used in scRNA-seq with Scanpy, extended with a obsm["spatial"] slot for X/Y coordinates.

sc_dat.csv cell x protein counts sc_meta.csv X, Y, patient, ROI ANNDATA OBJECT adata.X 700K cells x 38 proteins adata.obs patient, ROI, cell_type adata.obsm ["spatial"] → X, Y coords adata.var protein names (38)
Figure 3 — AnnData structure for IMC. Identical to scRNA-seq except adata.X holds protein intensities and adata.obsm["spatial"] stores the X/Y tissue coordinates that make it spatial.
Python01_load.py
import scanpy as sc
import squidpy as sq
import pandas as pd
import numpy as np

# ── Download from Zenodo ─────────────────────────────────────────
# doi.org/10.5281/zenodo.3518284
# Files: SingleCell_and_Metadata.zip → sc_dat.csv + sc_meta.csv

# ── Load counts and metadata ─────────────────────────────────────
counts = pd.read_csv("sc_dat.csv", index_col=0)
meta   = pd.read_csv("sc_meta.csv", index_col=0)

# ── Build AnnData object ─────────────────────────────────────────
adata = sc.AnnData(
    X=counts.values,
    obs=meta,
    var=pd.DataFrame(index=counts.columns)
)
adata.obsm["spatial"] = meta[["X", "Y"]].values

print(adata)
# AnnData n_obs x n_vars = 699188 x 38

Section 06 — Pipeline

QC & Filtering

IMC QC is simpler than scRNA-seq — no mitochondrial gene filter needed. The main concerns are cells that are too small (segmentation artifacts), too large (merged doublets), or have very low total signal (empty masks or debris).

Cell area
Filter: <10 or >500 px

Too small = debris or segmentation artifact. Too large = two cells merged into one mask (spatial doublet).

DNA signal
Filter: low Ir191/Ir193

Jackson panel includes two DNA intercalators. Cells with near-zero DNA signal are empty masks or background.

Total signal
Filter: extreme outliers

Sum of all channel intensities. Extreme values = ablation artifacts or staining clumps not caught by the area filter.

Python02_qc.py
# ── Compute QC metrics ───────────────────────────────────────────
adata.obs["n_counts"] = adata.X.sum(axis=1)
adata.obs["area"]    = meta["Area"]

# ── Visualize distributions before filtering ─────────────────────
sc.pl.violin(adata, ["area", "n_counts"], jitter=0.4, multi_panel=True)

# ── Filter ───────────────────────────────────────────────────────
adata = adata[
    (adata.obs["area"] > 10) &
    (adata.obs["area"] < 500)
].copy()

print(f"After QC: {adata.n_obs} cells")
⚠️
Thresholds are dataset-dependent — always look at distributions first

The Jackson dataset is already reasonably clean. For your own IMC data, plot area and signal distributions before setting any thresholds. Different tissues and staining protocols produce very different quality profiles.


Section 07 — Pipeline

Normalization

IMC data is not normalized like RNA-seq. The standard is arcsinh transformation with a cofactor (typically 1 for IMC, 5 for CyTOF). This compresses the dynamic range while preserving the zero-inflation structure of protein data.

Unlike RNA-seq where you normalize for sequencing depth, IMC intensities are already comparable across cells — each measurement is an absolute ion count per ablation spot. The arcsinh transform handles the extreme right-skew of protein distributions: a handful of highly stained cells shouldn't dominate the embedding.
Python03_normalize.py
# ── arcsinh transformation (cofactor = 1 for IMC) ────────────────
# cofactor = 5 for CyTOF/flow, 1 for IMC (lower signal range)
cofactor = 1
adata.X = np.arcsinh(adata.X / cofactor)

# ── Store raw for reference ──────────────────────────────────────
adata.layers["raw"] = adata.X.copy()

# ── Subset to cell-type markers (exclude segmentation channels) ───
# Remove DNA1, DNA2, HistoneH3 — used for segmentation, not biology
marker_cols = [c for c in adata.var_names
               if c not in ["DNA1", "DNA2", "HistoneH3"]]
adata_markers = adata[:, marker_cols].copy()
ℹ️
Why arcsinh and not log1p?

Log1p is designed for count data (integers, many zeros). IMC intensities are continuous non-negative real numbers. Arcsinh is the standard for flow and mass cytometry — it behaves like log for large values but is linear near zero, which avoids distorting the heavily zero-inflated distribution.


Section 08 — Pipeline

Dimensionality Reduction & Clustering

With only 38 proteins (vs 20,000 genes in scRNA-seq), PCA is fast and you need far fewer components. The rest of the pipeline — neighbor graph, UMAP, Leiden — is identical to scRNA-seq.

38 proteins arcsinh PCA 15 PCs fewer than scRNA k-NN Neighbor graph protein similarity UMAP Leiden UMAP + clusters cell populations identified
Figure 4 — Dimensionality reduction pipeline for IMC. Identical to scRNA-seq except 15 PCs is sufficient (38 proteins vs ~2000 HVGs in scRNA-seq).
Python04_cluster.py
# ── PCA ──────────────────────────────────────────────────────────
sc.tl.pca(adata_markers, n_comps=15)   # fewer than scRNA (38 proteins)

# ── Neighbor graph ───────────────────────────────────────────────
sc.pp.neighbors(adata_markers, n_neighbors=10, n_pcs=15)

# ── UMAP ─────────────────────────────────────────────────────────
sc.tl.umap(adata_markers, random_state=42)

# ── Leiden clustering ────────────────────────────────────────────
sc.tl.leiden(adata_markers, resolution=0.5, random_state=0)

# ── Visualize ────────────────────────────────────────────────────
sc.pl.umap(adata_markers, color=["leiden", "CK5", "CD45"], ncols=3)

Section 09 — Pipeline

Cell Type Annotation

With a predefined protein panel, annotation is more manual than scRNA-seq. You know exactly which proteins distinguish each cell type — there are no surprises from transcriptome-wide discovery.

Python05_annotate.py
# ── Dotplot: defining markers per cluster ────────────────────────
marker_dict = {
    "Epithelial":  ["CK5", "CK8", "CK18", "EpCAM"],
    "Tumor":       ["HER2", "ER", "Ki67", "p53"],
    "T cells":     ["CD3", "CD4", "CD8a"],
    "Macrophages": ["CD68", "CD163"],
    "B cells":     ["CD20"],
    "Stromal":     ["Vimentin", "SMA", "FAP"],
}
sc.pl.dotplot(adata_markers, marker_dict,
              groupby="leiden", standard_scale="var")

# ── Assign cell type labels ───────────────────────────────────────
cell_type_map = {
    "0": "Tumor",     "1": "CD8 T",
    "2": "Stromal",   "3": "CD4 T",
    "4": "Macrophage", "5": "Epithelial",
    "6": "B cell",    "7": "Endothelial",
}
adata_markers.obs["cell_type"] = (
    adata_markers.obs["leiden"].map(cell_type_map)
)
sc.pl.umap(adata_markers, color="cell_type", legend_loc="on data")
💡
Watch out for mixed-lineage clusters

A cluster expressing both CK5 and CD45 is likely a segmentation error — a tumor cell mask bleeding into a nearby immune cell — rather than a real biology. Always sanity-check clusters that co-express markers from different lineages and consider removing or re-segmenting them.


Section 10 — Spatial Analysis

Spatial Analysis with Squidpy

This is the part that makes IMC unique. With X/Y coordinates for every cell you can ask: which cell types cluster together in tissue? Which pairs are spatially co-enriched beyond what random placement would predict?

Tissue ROI neighborhood radius Neighborhood enrichment Tumor Strom. T cell B cell Tumor Strom. T cell Co-occurrence score distance (μm)
Figure 5 — Squidpy spatial analysis. Left: tissue with cell types and neighborhood radius. Middle: neighborhood enrichment heatmap (green = co-enriched beyond chance). Right: co-occurrence score vs distance — how enrichment changes as you expand the radius.
Python06_spatial.py
# ── Build spatial neighborhood graph ─────────────────────────────
sq.gr.spatial_neighbors(adata_markers, coord_type="generic", delaunay=True)

# ── Neighborhood enrichment ───────────────────────────────────────
# Which cell type pairs are spatially co-enriched beyond chance?
sq.gr.nhood_enrichment(adata_markers, cluster_key="cell_type")
sq.pl.nhood_enrichment(adata_markers, cluster_key="cell_type",
    method="average", cmap="RdBu_r")

# ── Co-occurrence score ───────────────────────────────────────────
# How does spatial enrichment change as you increase radius?
sq.gr.co_occurrence(adata_markers, cluster_key="cell_type")
sq.pl.co_occurrence(adata_markers, cluster_key="cell_type",
    clusters="Tumor")

# ── Ripley's L: is a cell type spatially self-clustered? ──────────
sq.gr.ripley(adata_markers, cluster_key="cell_type", mode="L")
sq.pl.ripley(adata_markers, cluster_key="cell_type", mode="L")
ℹ️
Interpreting neighborhood enrichment scores

Positive = two cell types are neighbors more often than expected by chance. Zero = random. Negative = spatial avoidance. In breast cancer you typically see strong Tumor–Stromal co-enrichment and variable Tumor–T cell proximity depending on immune infiltration pattern.


Section 11 — Spatial Analysis

Cell–Cell Interaction Analysis

Squidpy's interaction matrix counts how often each pair of cell types are direct neighbors in the spatial graph — a more stringent proximity test than neighborhood enrichment.

Python07_interaction.py
# ── Interaction matrix ────────────────────────────────────────────
# Counts direct neighbor pairs in the Delaunay graph
sq.gr.interaction_matrix(adata_markers, cluster_key="cell_type")
sq.pl.interaction_matrix(adata_markers, cluster_key="cell_type")

# ── Centrality scores ─────────────────────────────────────────────
# Which cell types are most "central" in the spatial network?
sq.gr.centrality_scores(adata_markers, cluster_key="cell_type")
sq.pl.centrality_scores(adata_markers, cluster_key="cell_type")

# ── Combine with L-R analysis (optional) ────────────────────────
# Pass spatial neighbor graph to CellChat or NicheNet as
# an adjacency filter — only infer interactions between
# spatially adjacent cells rather than all cell type pairs
⚠️
Spatial proximity ≠ functional interaction

Just because two cell types are neighbors in the tissue does not mean they're signaling to each other. Proximity is necessary but not sufficient for interaction. Use spatial adjacency to filter and prioritize L-R predictions, but always validate with functional data.


Section 12 — Output

Publication Figures

A typical spatial proteomics paper figure set includes these panels:

🔳
Tissue image
Composite channel overlay on ROI
🔵
Cell type UMAP
Cells colored by annotated type
🏠
Spatial map
Cell types overlaid on tissue X/Y
📈
Nhood enrichment
Cell type co-enrichment heatmap
📊
Co-occurrence
Enrichment score vs distance curve
🆕
Dotplot
Protein expression per cell type
📊
Composition bar
% each cell type per patient/ROI
Interaction matrix
Direct neighbor pair counts
🔢
Survival / clinical
Cell type abundance vs outcome
🎯
The figure reviewers expect that scRNA-seq can't show

Always include a spatial map figure — cell type annotations overlaid on actual tissue coordinates for at least one representative ROI. This is the visual that proves your analysis is spatial and not just standard scRNA-seq on protein data. Show a high-TME and low-TME patient side by side if possible — the difference in spatial organization is striking and immediately convincing.