Spatial Proteomics Tutorial — Jackson IMC Dataset

Section 01 — Concepts

What is Spatial Proteomics / IMC?

Imaging Mass Cytometry (IMC) uses metal-tagged antibodies and a laser to simultaneously measure 40+ proteins per cell while preserving tissue architecture. Every cell gets an X/Y coordinate plus a protein expression vector — the combination is what makes it "spatial."

Think of it like scRNA-seq but instead of sequencing reads, you're measuring metal ion intensities from laser ablation. And instead of losing spatial context when you dissociate cells, every cell stays pinned to its location in the tissue.

Figure 1 — IMC workflow. Tissue stained with metal-tagged antibodies. Laser ablates spot by spot. Mass spec detects metal ions per spot. Output: cell × protein matrix with X/Y coordinates for every cell.

IMC vs scRNA-seq — key differences

Feature	scRNA-seq	IMC / Spatial Proteomics
Input	Dissociated cells	Intact tissue section
Measurement	RNA transcripts	Protein intensity (metal tags)
Feature count	~20,000 genes	~40 proteins (panel-defined)
Spatial context	✗ Lost on dissociation	✓ X/Y coordinates preserved
Cell proximity	✗ Inferred only	✓ Directly measured
Normalization	Library size → log1p	arcsinh with cofactor
QC	Mito%, n_genes, UMIs	Cell area, signal quality
Annotation	Computational (marker genes)	Manual (known protein panel)

Section 02 — Dataset

The Jackson 2020 Dataset

The Single-Cell Pathology Landscape of Breast Cancer

Jackson, Fischer et al. · Nature 578:615–620 · 2020 · Bodenmiller Lab, University of Zurich

IMC profiling of breast cancer tissue microarrays. Pre-segmented cell × protein matrices and spatial coordinates are publicly available on Zenodo. The gold-standard public IMC dataset — used as the canonical benchmark in Squidpy, CATALYST, and steinbock tutorials.

352

Patients

~700K

Cells total

Proteins

~280

Tumor cores (ROIs)

→ Zenodo DOI: 10.5281/zenodo.3518284

Key proteins in the panel

CK5

Basal epithelium

CK8/18

Luminal epithelium

HER2

Tumor subtype

CD45

Pan-immune

CD8a

Cytotoxic T cells

CD68

Macrophages

CD20

B cells

Vimentin

Stromal / EMT

SMA

Smooth muscle

Ki67

Proliferation

p53

Tumor suppressor

💡

Why this dataset is ideal for learning spatial proteomics

Large enough to be realistic, rich biological signal (tumor/immune/stromal interactions), comes pre-segmented so you skip the hardest step, and is the canonical benchmark for spatial proteomics tools. Every tutorial from Squidpy, CATALYST, and steinbock uses it.

Section 03 — Concepts

Upstream Pipeline (for reference)

You're skipping this for the Jackson dataset — you download pre-segmented data directly from Zenodo. But understanding what happened upstream explains the data structure you receive.

Figure 2 — IMC upstream pipeline. The Jackson dataset skips everything above the dashed line — you download pre-extracted feature matrices directly from Zenodo.

⚠️

Cell segmentation is the hardest and most consequential step

Poor segmentation propagates through everything downstream. A cell boundary that bleeds into a neighbor contaminates protein measurements. Jackson et al. used CellProfiler — modern pipelines use Mesmer (deep learning), which performs better in crowded tissue. For your own data, always visually inspect segmentation masks on a few ROIs before proceeding.

Section 04 — Setup

Environment Setup

Bashsetup.sh

conda create -n spatial python=3.10 -y
conda activate spatial

pip install squidpy scanpy anndata
pip install pandas numpy matplotlib seaborn
pip install leidenalg igraph
pip install jupyter

jupyter notebook

ℹ️

Squidpy vs Scanpy — what each does

Scanpy handles the single-cell analysis: QC, normalization, PCA, UMAP, Leiden clustering. This is identical to the scRNA-seq workflow. Squidpy adds the spatial layer on top: neighborhood enrichment, co-occurrence, spatial statistics. You need both — Squidpy is built on Scanpy's AnnData format.

Section 05 — Pipeline

Load the Pre-segmented Data

The Jackson dataset comes as pre-extracted CSV files. You build an AnnData object — the same format used in scRNA-seq with Scanpy, extended with a obsm["spatial"] slot for X/Y coordinates.

Figure 3 — AnnData structure for IMC. Identical to scRNA-seq except adata.X holds protein intensities and adata.obsm["spatial"] stores the X/Y tissue coordinates that make it spatial.

Python01_load.py

import scanpy as sc
import squidpy as sq
import pandas as pd
import numpy as np

# ── Download from Zenodo ─────────────────────────────────────────
# doi.org/10.5281/zenodo.3518284
# Files: SingleCell_and_Metadata.zip → sc_dat.csv + sc_meta.csv

# ── Load counts and metadata ─────────────────────────────────────
counts = pd.read_csv("sc_dat.csv", index_col=0)
meta   = pd.read_csv("sc_meta.csv", index_col=0)

# ── Build AnnData object ─────────────────────────────────────────
adata = sc.AnnData(
    X=counts.values,
    obs=meta,
    var=pd.DataFrame(index=counts.columns)
)
adata.obsm["spatial"] = meta[["X", "Y"]].values

print(adata)
# AnnData n_obs x n_vars = 699188 x 38

Section 06 — Pipeline

QC & Filtering

IMC QC is simpler than scRNA-seq — no mitochondrial gene filter needed. The main concerns are cells that are too small (segmentation artifacts), too large (merged doublets), or have very low total signal (empty masks or debris).

Cell area

Filter: <10 or >500 px

Too small = debris or segmentation artifact. Too large = two cells merged into one mask (spatial doublet).

DNA signal

Filter: low Ir191/Ir193

Jackson panel includes two DNA intercalators. Cells with near-zero DNA signal are empty masks or background.

Total signal

Filter: extreme outliers

Sum of all channel intensities. Extreme values = ablation artifacts or staining clumps not caught by the area filter.

Python02_qc.py

# ── Compute QC metrics ───────────────────────────────────────────
adata.obs["n_counts"] = adata.X.sum(axis=1)
adata.obs["area"]    = meta["Area"]

# ── Visualize distributions before filtering ─────────────────────
sc.pl.violin(adata, ["area", "n_counts"], jitter=0.4, multi_panel=True)

# ── Filter ───────────────────────────────────────────────────────
adata = adata[
    (adata.obs["area"] > 10) &
    (adata.obs["area"] < 500)
].copy()

print(f"After QC: {adata.n_obs} cells")

⚠️

Thresholds are dataset-dependent — always look at distributions first

The Jackson dataset is already reasonably clean. For your own IMC data, plot area and signal distributions before setting any thresholds. Different tissues and staining protocols produce very different quality profiles.

Section 07 — Pipeline

Normalization

IMC data is not normalized like RNA-seq. The standard is arcsinh transformation with a cofactor (typically 1 for IMC, 5 for CyTOF). This compresses the dynamic range while preserving the zero-inflation structure of protein data.

Unlike RNA-seq where you normalize for sequencing depth, IMC intensities are already comparable across cells — each measurement is an absolute ion count per ablation spot. The arcsinh transform handles the extreme right-skew of protein distributions: a handful of highly stained cells shouldn't dominate the embedding.

Python03_normalize.py

# ── arcsinh transformation (cofactor = 1 for IMC) ────────────────
# cofactor = 5 for CyTOF/flow, 1 for IMC (lower signal range)
cofactor = 1
adata.X = np.arcsinh(adata.X / cofactor)

# ── Store raw for reference ──────────────────────────────────────
adata.layers["raw"] = adata.X.copy()

# ── Subset to cell-type markers (exclude segmentation channels) ───
# Remove DNA1, DNA2, HistoneH3 — used for segmentation, not biology
marker_cols = [c for c in adata.var_names
               if c not in ["DNA1", "DNA2", "HistoneH3"]]
adata_markers = adata[:, marker_cols].copy()

ℹ️

Why arcsinh and not log1p?

Log1p is designed for count data (integers, many zeros). IMC intensities are continuous non-negative real numbers. Arcsinh is the standard for flow and mass cytometry — it behaves like log for large values but is linear near zero, which avoids distorting the heavily zero-inflated distribution.

Section 08 — Pipeline

Dimensionality Reduction & Clustering

With only 38 proteins (vs 20,000 genes in scRNA-seq), PCA is fast and you need far fewer components. The rest of the pipeline — neighbor graph, UMAP, Leiden — is identical to scRNA-seq.

Figure 4 — Dimensionality reduction pipeline for IMC. Identical to scRNA-seq except 15 PCs is sufficient (38 proteins vs ~2000 HVGs in scRNA-seq).

Python04_cluster.py

# ── PCA ──────────────────────────────────────────────────────────
sc.tl.pca(adata_markers, n_comps=15)   # fewer than scRNA (38 proteins)

# ── Neighbor graph ───────────────────────────────────────────────
sc.pp.neighbors(adata_markers, n_neighbors=10, n_pcs=15)

# ── UMAP ─────────────────────────────────────────────────────────
sc.tl.umap(adata_markers, random_state=42)

# ── Leiden clustering ────────────────────────────────────────────
sc.tl.leiden(adata_markers, resolution=0.5, random_state=0)

# ── Visualize ────────────────────────────────────────────────────
sc.pl.umap(adata_markers, color=["leiden", "CK5", "CD45"], ncols=3)

Section 09 — Pipeline

Cell Type Annotation

With a predefined protein panel, annotation is more manual than scRNA-seq. You know exactly which proteins distinguish each cell type — there are no surprises from transcriptome-wide discovery.

Python05_annotate.py

# ── Dotplot: defining markers per cluster ────────────────────────
marker_dict = {
    "Epithelial":  ["CK5", "CK8", "CK18", "EpCAM"],
    "Tumor":       ["HER2", "ER", "Ki67", "p53"],
    "T cells":     ["CD3", "CD4", "CD8a"],
    "Macrophages": ["CD68", "CD163"],
    "B cells":     ["CD20"],
    "Stromal":     ["Vimentin", "SMA", "FAP"],
}
sc.pl.dotplot(adata_markers, marker_dict,
              groupby="leiden", standard_scale="var")

# ── Assign cell type labels ───────────────────────────────────────
cell_type_map = {
    "0": "Tumor",     "1": "CD8 T",
    "2": "Stromal",   "3": "CD4 T",
    "4": "Macrophage", "5": "Epithelial",
    "6": "B cell",    "7": "Endothelial",
}
adata_markers.obs["cell_type"] = (
    adata_markers.obs["leiden"].map(cell_type_map)
)
sc.pl.umap(adata_markers, color="cell_type", legend_loc="on data")

💡

Watch out for mixed-lineage clusters

A cluster expressing both CK5 and CD45 is likely a segmentation error — a tumor cell mask bleeding into a nearby immune cell — rather than a real biology. Always sanity-check clusters that co-express markers from different lineages and consider removing or re-segmenting them.

Section 10 — Spatial Analysis

Spatial Analysis with Squidpy

This is the part that makes IMC unique. With X/Y coordinates for every cell you can ask: which cell types cluster together in tissue? Which pairs are spatially co-enriched beyond what random placement would predict?

Figure 5 — Squidpy spatial analysis. Left: tissue with cell types and neighborhood radius. Middle: neighborhood enrichment heatmap (green = co-enriched beyond chance). Right: co-occurrence score vs distance — how enrichment changes as you expand the radius.

Python06_spatial.py

# ── Build spatial neighborhood graph ─────────────────────────────
sq.gr.spatial_neighbors(adata_markers, coord_type="generic", delaunay=True)

# ── Neighborhood enrichment ───────────────────────────────────────
# Which cell type pairs are spatially co-enriched beyond chance?
sq.gr.nhood_enrichment(adata_markers, cluster_key="cell_type")
sq.pl.nhood_enrichment(adata_markers, cluster_key="cell_type",
    method="average", cmap="RdBu_r")

# ── Co-occurrence score ───────────────────────────────────────────
# How does spatial enrichment change as you increase radius?
sq.gr.co_occurrence(adata_markers, cluster_key="cell_type")
sq.pl.co_occurrence(adata_markers, cluster_key="cell_type",
    clusters="Tumor")

# ── Ripley's L: is a cell type spatially self-clustered? ──────────
sq.gr.ripley(adata_markers, cluster_key="cell_type", mode="L")
sq.pl.ripley(adata_markers, cluster_key="cell_type", mode="L")

ℹ️

Interpreting neighborhood enrichment scores

Positive = two cell types are neighbors more often than expected by chance. Zero = random. Negative = spatial avoidance. In breast cancer you typically see strong Tumor–Stromal co-enrichment and variable Tumor–T cell proximity depending on immune infiltration pattern.

Section 11 — Spatial Analysis

Cell–Cell Interaction Analysis

Squidpy's interaction matrix counts how often each pair of cell types are direct neighbors in the spatial graph — a more stringent proximity test than neighborhood enrichment.

Python07_interaction.py

# ── Interaction matrix ────────────────────────────────────────────
# Counts direct neighbor pairs in the Delaunay graph
sq.gr.interaction_matrix(adata_markers, cluster_key="cell_type")
sq.pl.interaction_matrix(adata_markers, cluster_key="cell_type")

# ── Centrality scores ─────────────────────────────────────────────
# Which cell types are most "central" in the spatial network?
sq.gr.centrality_scores(adata_markers, cluster_key="cell_type")
sq.pl.centrality_scores(adata_markers, cluster_key="cell_type")

# ── Combine with L-R analysis (optional) ────────────────────────
# Pass spatial neighbor graph to CellChat or NicheNet as
# an adjacency filter — only infer interactions between
# spatially adjacent cells rather than all cell type pairs

⚠️

Spatial proximity ≠ functional interaction

Just because two cell types are neighbors in the tissue does not mean they're signaling to each other. Proximity is necessary but not sufficient for interaction. Use spatial adjacency to filter and prioritize L-R predictions, but always validate with functional data.

Section 12 — Output

Publication Figures

A typical spatial proteomics paper figure set includes these panels:

🔳

Tissue image

Composite channel overlay on ROI

🔵

Cell type UMAP

Cells colored by annotated type

🏠

Spatial map

Cell types overlaid on tissue X/Y

📈

Nhood enrichment

Cell type co-enrichment heatmap

📊

Co-occurrence

Enrichment score vs distance curve

🆕

Dotplot

Protein expression per cell type

📊

Composition bar

% each cell type per patient/ROI

⛏

Interaction matrix

Direct neighbor pair counts

🔢

Survival / clinical

Cell type abundance vs outcome

🎯

The figure reviewers expect that scRNA-seq can't show

Always include a spatial map figure — cell type annotations overlaid on actual tissue coordinates for at least one representative ROI. This is the visual that proves your analysis is spatial and not just standard scRNA-seq on protein data. Show a high-TME and low-TME patient side by side if possible — the difference in spatial organization is striking and immediately convincing.