Jackson 2020 Breast Cancer
A complete guide to spatial proteomics analysis — from what IMC actually measures to neighborhood analysis and cell–cell interaction. Uses the Jackson 2020 breast cancer dataset. Every decision explained.
What is Spatial Proteomics / IMC?
Imaging Mass Cytometry (IMC) uses metal-tagged antibodies and a laser to simultaneously measure 40+ proteins per cell while preserving tissue architecture. Every cell gets an X/Y coordinate plus a protein expression vector — the combination is what makes it "spatial."
IMC vs scRNA-seq — key differences
| Feature | scRNA-seq | IMC / Spatial Proteomics |
|---|---|---|
| Input | Dissociated cells | Intact tissue section |
| Measurement | RNA transcripts | Protein intensity (metal tags) |
| Feature count | ~20,000 genes | ~40 proteins (panel-defined) |
| Spatial context | ✗ Lost on dissociation | ✓ X/Y coordinates preserved |
| Cell proximity | ✗ Inferred only | ✓ Directly measured |
| Normalization | Library size → log1p | arcsinh with cofactor |
| QC | Mito%, n_genes, UMIs | Cell area, signal quality |
| Annotation | Computational (marker genes) | Manual (known protein panel) |
The Jackson 2020 Dataset
IMC profiling of breast cancer tissue microarrays. Pre-segmented cell × protein matrices and spatial coordinates are publicly available on Zenodo. The gold-standard public IMC dataset — used as the canonical benchmark in Squidpy, CATALYST, and steinbock tutorials.
Key proteins in the panel
Large enough to be realistic, rich biological signal (tumor/immune/stromal interactions), comes pre-segmented so you skip the hardest step, and is the canonical benchmark for spatial proteomics tools. Every tutorial from Squidpy, CATALYST, and steinbock uses it.
Upstream Pipeline (for reference)
You're skipping this for the Jackson dataset — you download pre-segmented data directly from Zenodo. But understanding what happened upstream explains the data structure you receive.
Poor segmentation propagates through everything downstream. A cell boundary that bleeds into a neighbor contaminates protein measurements. Jackson et al. used CellProfiler — modern pipelines use Mesmer (deep learning), which performs better in crowded tissue. For your own data, always visually inspect segmentation masks on a few ROIs before proceeding.
Environment Setup
conda create -n spatial python=3.10 -y conda activate spatial pip install squidpy scanpy anndata pip install pandas numpy matplotlib seaborn pip install leidenalg igraph pip install jupyter jupyter notebook
Scanpy handles the single-cell analysis: QC, normalization, PCA, UMAP, Leiden clustering. This is identical to the scRNA-seq workflow. Squidpy adds the spatial layer on top: neighborhood enrichment, co-occurrence, spatial statistics. You need both — Squidpy is built on Scanpy's AnnData format.
Load the Pre-segmented Data
The Jackson dataset comes as pre-extracted CSV files. You build an AnnData object — the same format used in scRNA-seq with Scanpy, extended with a obsm["spatial"] slot for X/Y coordinates.
import scanpy as sc import squidpy as sq import pandas as pd import numpy as np # ── Download from Zenodo ───────────────────────────────────────── # doi.org/10.5281/zenodo.3518284 # Files: SingleCell_and_Metadata.zip → sc_dat.csv + sc_meta.csv # ── Load counts and metadata ───────────────────────────────────── counts = pd.read_csv("sc_dat.csv", index_col=0) meta = pd.read_csv("sc_meta.csv", index_col=0) # ── Build AnnData object ───────────────────────────────────────── adata = sc.AnnData( X=counts.values, obs=meta, var=pd.DataFrame(index=counts.columns) ) adata.obsm["spatial"] = meta[["X", "Y"]].values print(adata) # AnnData n_obs x n_vars = 699188 x 38
QC & Filtering
IMC QC is simpler than scRNA-seq — no mitochondrial gene filter needed. The main concerns are cells that are too small (segmentation artifacts), too large (merged doublets), or have very low total signal (empty masks or debris).
Too small = debris or segmentation artifact. Too large = two cells merged into one mask (spatial doublet).
Jackson panel includes two DNA intercalators. Cells with near-zero DNA signal are empty masks or background.
Sum of all channel intensities. Extreme values = ablation artifacts or staining clumps not caught by the area filter.
# ── Compute QC metrics ─────────────────────────────────────────── adata.obs["n_counts"] = adata.X.sum(axis=1) adata.obs["area"] = meta["Area"] # ── Visualize distributions before filtering ───────────────────── sc.pl.violin(adata, ["area", "n_counts"], jitter=0.4, multi_panel=True) # ── Filter ─────────────────────────────────────────────────────── adata = adata[ (adata.obs["area"] > 10) & (adata.obs["area"] < 500) ].copy() print(f"After QC: {adata.n_obs} cells")
The Jackson dataset is already reasonably clean. For your own IMC data, plot area and signal distributions before setting any thresholds. Different tissues and staining protocols produce very different quality profiles.
Normalization
IMC data is not normalized like RNA-seq. The standard is arcsinh transformation with a cofactor (typically 1 for IMC, 5 for CyTOF). This compresses the dynamic range while preserving the zero-inflation structure of protein data.
# ── arcsinh transformation (cofactor = 1 for IMC) ──────────────── # cofactor = 5 for CyTOF/flow, 1 for IMC (lower signal range) cofactor = 1 adata.X = np.arcsinh(adata.X / cofactor) # ── Store raw for reference ────────────────────────────────────── adata.layers["raw"] = adata.X.copy() # ── Subset to cell-type markers (exclude segmentation channels) ─── # Remove DNA1, DNA2, HistoneH3 — used for segmentation, not biology marker_cols = [c for c in adata.var_names if c not in ["DNA1", "DNA2", "HistoneH3"]] adata_markers = adata[:, marker_cols].copy()
Log1p is designed for count data (integers, many zeros). IMC intensities are continuous non-negative real numbers. Arcsinh is the standard for flow and mass cytometry — it behaves like log for large values but is linear near zero, which avoids distorting the heavily zero-inflated distribution.
Dimensionality Reduction & Clustering
With only 38 proteins (vs 20,000 genes in scRNA-seq), PCA is fast and you need far fewer components. The rest of the pipeline — neighbor graph, UMAP, Leiden — is identical to scRNA-seq.
# ── PCA ────────────────────────────────────────────────────────── sc.tl.pca(adata_markers, n_comps=15) # fewer than scRNA (38 proteins) # ── Neighbor graph ─────────────────────────────────────────────── sc.pp.neighbors(adata_markers, n_neighbors=10, n_pcs=15) # ── UMAP ───────────────────────────────────────────────────────── sc.tl.umap(adata_markers, random_state=42) # ── Leiden clustering ──────────────────────────────────────────── sc.tl.leiden(adata_markers, resolution=0.5, random_state=0) # ── Visualize ──────────────────────────────────────────────────── sc.pl.umap(adata_markers, color=["leiden", "CK5", "CD45"], ncols=3)
Cell Type Annotation
With a predefined protein panel, annotation is more manual than scRNA-seq. You know exactly which proteins distinguish each cell type — there are no surprises from transcriptome-wide discovery.
# ── Dotplot: defining markers per cluster ──────────────────────── marker_dict = { "Epithelial": ["CK5", "CK8", "CK18", "EpCAM"], "Tumor": ["HER2", "ER", "Ki67", "p53"], "T cells": ["CD3", "CD4", "CD8a"], "Macrophages": ["CD68", "CD163"], "B cells": ["CD20"], "Stromal": ["Vimentin", "SMA", "FAP"], } sc.pl.dotplot(adata_markers, marker_dict, groupby="leiden", standard_scale="var") # ── Assign cell type labels ─────────────────────────────────────── cell_type_map = { "0": "Tumor", "1": "CD8 T", "2": "Stromal", "3": "CD4 T", "4": "Macrophage", "5": "Epithelial", "6": "B cell", "7": "Endothelial", } adata_markers.obs["cell_type"] = ( adata_markers.obs["leiden"].map(cell_type_map) ) sc.pl.umap(adata_markers, color="cell_type", legend_loc="on data")
A cluster expressing both CK5 and CD45 is likely a segmentation error — a tumor cell mask bleeding into a nearby immune cell — rather than a real biology. Always sanity-check clusters that co-express markers from different lineages and consider removing or re-segmenting them.
Spatial Analysis with Squidpy
This is the part that makes IMC unique. With X/Y coordinates for every cell you can ask: which cell types cluster together in tissue? Which pairs are spatially co-enriched beyond what random placement would predict?
# ── Build spatial neighborhood graph ───────────────────────────── sq.gr.spatial_neighbors(adata_markers, coord_type="generic", delaunay=True) # ── Neighborhood enrichment ─────────────────────────────────────── # Which cell type pairs are spatially co-enriched beyond chance? sq.gr.nhood_enrichment(adata_markers, cluster_key="cell_type") sq.pl.nhood_enrichment(adata_markers, cluster_key="cell_type", method="average", cmap="RdBu_r") # ── Co-occurrence score ─────────────────────────────────────────── # How does spatial enrichment change as you increase radius? sq.gr.co_occurrence(adata_markers, cluster_key="cell_type") sq.pl.co_occurrence(adata_markers, cluster_key="cell_type", clusters="Tumor") # ── Ripley's L: is a cell type spatially self-clustered? ────────── sq.gr.ripley(adata_markers, cluster_key="cell_type", mode="L") sq.pl.ripley(adata_markers, cluster_key="cell_type", mode="L")
Positive = two cell types are neighbors more often than expected by chance. Zero = random. Negative = spatial avoidance. In breast cancer you typically see strong Tumor–Stromal co-enrichment and variable Tumor–T cell proximity depending on immune infiltration pattern.
Cell–Cell Interaction Analysis
Squidpy's interaction matrix counts how often each pair of cell types are direct neighbors in the spatial graph — a more stringent proximity test than neighborhood enrichment.
# ── Interaction matrix ──────────────────────────────────────────── # Counts direct neighbor pairs in the Delaunay graph sq.gr.interaction_matrix(adata_markers, cluster_key="cell_type") sq.pl.interaction_matrix(adata_markers, cluster_key="cell_type") # ── Centrality scores ───────────────────────────────────────────── # Which cell types are most "central" in the spatial network? sq.gr.centrality_scores(adata_markers, cluster_key="cell_type") sq.pl.centrality_scores(adata_markers, cluster_key="cell_type") # ── Combine with L-R analysis (optional) ──────────────────────── # Pass spatial neighbor graph to CellChat or NicheNet as # an adjacency filter — only infer interactions between # spatially adjacent cells rather than all cell type pairs
Just because two cell types are neighbors in the tissue does not mean they're signaling to each other. Proximity is necessary but not sufficient for interaction. Use spatial adjacency to filter and prioritize L-R predictions, but always validate with functional data.
Publication Figures
A typical spatial proteomics paper figure set includes these panels:
Always include a spatial map figure — cell type annotations overlaid on actual tissue coordinates for at least one representative ROI. This is the visual that proves your analysis is spatial and not just standard scRNA-seq on protein data. Show a high-TME and low-TME patient side by side if possible — the difference in spatial organization is striking and immediately convincing.