Bulk RNA-seq Tutorial — Airway Dataset

Section 01

What is Bulk RNA-seq?

Bulk RNA-seq measures the average gene expression across all cells in a sample simultaneously. You extract all the RNA from a tissue or cell culture, sequence it, and count how many reads map to each gene — producing one expression profile per sample.

Bulk RNA-seq is like recording the total noise level in a crowd — one number reflecting everyone together. scRNA-seq is like recording each person's voice individually. Bulk is faster, cheaper, and statistically more powerful for condition comparisons. Single-cell resolves cell-type heterogeneity that bulk misses entirely.

Figure 1 — Complete bulk RNA-seq workflow. Upstream steps produce a genes × samples count matrix; DESeq2 handles everything from there.

Section 02

Bulk vs scRNA-seq

The two technologies answer different questions. Choose based on your biological question, not what's available.

Feature	Bulk RNA-seq	scRNA-seq
Resolution	Tissue average	Single cell
Statistical power	High (more reads/gene)	Lower (sparse)
Cost per sample	Low (~$50–200)	Higher (~$500–2000)
Cell type resolution	None — all cells mixed	Full breakdown
DEG analysis	Gold standard (DESeq2)	Pseudobulk recommended
Best for	Condition comparisons	Cell type discovery

💡

Pseudobulk DEG — best of both worlds

If you have scRNA-seq data and want rigorous DEG, aggregate counts per cell type per donor (pseudobulk), then run DESeq2 on the result. Cell-type-specific DEGs with proper statistical modeling — current best practice.

Section 03

The Airway Dataset

Airway Smooth Muscle Cells + Dexamethasone

Himes et al. 2014 · PLOS Genetics · GSE52778 · Bioconductor airway package

Human airway smooth muscle cells from 4 donors treated with dexamethasone — a synthetic glucocorticoid with anti-inflammatory effects used in asthma treatment. Classic 2-condition paired design. Loads in one line of R.

8

Samples total

4

Cell lines

63,677

Genes

2

Conditions

→ Bioconductor: airway package

🔬

Understanding the paired design

4 donors × 2 conditions = 8 samples. Each donor contributes one treated and one untreated sample. Pairing removes donor-to-donor variation from the error term, giving substantially more statistical power. Modeled in DESeq2 as ~ cell + dex.

Figure 2 — Paired design. Each donor contributes one untreated and one dex-treated sample. Modeled as ~ cell + dex in DESeq2.

Section 04

Upstream: FASTQ → Count Matrix

The airway package provides pre-computed counts. For real data, this upstream processing runs on a compute cluster.

🔍

FastQC / MultiQC cluster

Check per-base quality, adapters, GC content. Trim with fastp if quality drops at read ends.

🗺️

STAR alignment → BAM cluster

Splice-aware alignment to reference genome. Expect >80% alignment rate for good data. Output: sorted BAM files per sample.

🔢

featureCounts → count matrix cluster

Count reads per gene using GTF annotation. Output: genes × samples integer matrix. Always use raw counts as DESeq2 input — never normalize first.

⚡

nf-core/rnaseq — recommended recommended

Nextflow pipeline wrapping all above steps plus MultiQC. Fully reproducible, one command, outputs count matrix ready for DESeq2.

Section 05

What DESeq2 Actually Does

DESeq2 is not a t-test on counts. Its statistical model explains why it works better — especially with small sample sizes (n=3).

RNA-seq counts follow a negative binomial distribution — like Poisson but with overdispersion. DESeq2 models this explicitly: estimates dispersion per gene, shrinks toward a fitted trend, fits a GLM, tests with Wald test. The result: accurate p-values even with n=3.

Figure 3 — DESeq2's four-step model. Each step borrows information across all genes (empirical Bayes) — critical for reliable results with small n.

Three concepts to understand

Size factors

≠ TPM or FPKM

Median-of-ratios normalization. Always provide raw integer counts to DESeq2. Never pre-normalize.

Dispersion

gene-specific variance

Each gene has its own variance. DESeq2 shrinks per-gene estimates toward a trend — what makes n=3 work.

Adjusted p-value

FDR, not raw p

33k tests → ~1650 false positives at p<0.05 by chance. Always filter on Benjamini-Hochberg padj.

Section 06

Load the Data

R01_load.R

# Install (run once)
BiocManager::install(c("DESeq2", "airway", "apeglm",
                       "org.Hs.eg.db", "clusterProfiler",
                       "EnhancedVolcano", "pheatmap"))

library(DESeq2); library(airway)
library(ggplot2); library(pheatmap)

# ── Load airway SummarizedExperiment ────────────────────────────
data(airway)
se <- airway

dim(assay(se))   # 63677 × 8
colData(se)       # cell + dex per sample

# ── Relevel: untreated as reference ─────────────────────────────
se$dex <- relevel(se$dex, ref = "untrt")
# Ensures positive LFC = higher in treated

⚠️

Always relevel your reference condition

DESeq2 uses the first factor level as the denominator. Without relevel(), LFCs may be reversed. Always explicitly set your control as the reference level.

Section 07

Exploratory Data Analysis

Always explore data visually before DEG testing. EDA reveals outliers, batch effects, and confirms groups separate as expected.

R02_eda.R

# ── Build DESeqDataSet ───────────────────────────────────────────
dds <- DESeqDataSet(se, design = ~ cell + dex)

# ── Pre-filter low-count genes ───────────────────────────────────
dds <- dds[rowSums(counts(dds)) >= 10, ]

# ── rlog transform for visualization ────────────────────────────
# rlog better for small n; VST faster for large n
rld <- rlog(dds, blind = TRUE)

# ── PCA ──────────────────────────────────────────────────────────
plotPCA(rld, intgroup = c("dex", "cell"))

# ── Sample distance heatmap ──────────────────────────────────────
sampleDists <- dist(t(assay(rld)))
pheatmap(as.matrix(sampleDists),
  clustering_distance_rows = sampleDists,
  clustering_distance_cols = sampleDists,
  annotation_col = as.data.frame(colData(rld)[, c("dex", "cell")])
)

ℹ️

rlog vs VST vs raw counts

Raw counts — DESeq2 input only, never visualize. rlog — regularized log, best for small n (<30 samples), use for PCA and heatmaps here. VST — faster, better for large n. Both are visualization-only — DESeq2 normalizes raw counts internally.

Section 08

Run DESeq2

R03_deseq.R

# ── Run (wraps size factors + dispersion + GLM) ──────────────────
dds <- DESeq(dds)

# ── Check dispersion plot — key QC step ─────────────────────────
plotDispEsts(dds)
# Gene-wise (black) scattered around trend (red)
# Final shrunken estimates (blue) pulled toward trend
# No trend visible → data quality issue

# ── Check size factors ───────────────────────────────────────────
sizeFactors(dds)
# Values near 1.0 = similar library sizes (good)

Section 09

Extract & Filter Results

R04_results.R

library(org.Hs.eg.db); library(AnnotationDbi)

res <- results(dds,
  contrast = c("dex", "trt", "untrt"),
  alpha = 0.05
)
summary(res)

# ── Add gene symbols ─────────────────────────────────────────────
res$symbol <- mapIds(org.Hs.eg.db,
  keys = rownames(res), keytype = "ENSEMBL",
  column = "SYMBOL", multiVals = "first")

# ── Filter significant DEGs: FDR + fold change ───────────────────
sig <- res[!is.na(res$padj) &
           res$padj < 0.05 &
           abs(res$log2FoldChange) > 1, ]

write.csv(as.data.frame(res[order(res$padj),]),
          "results/deseq2_results.csv")

⚠️

Always use padj — and report both thresholds

Raw p<0.05 gives ~1,650 false positives with 33k tests. Filter on padj < 0.05. Also apply |LFC| > 1 to remove statistically significant but biologically trivial changes. Report both in your methods.

Section 10

LFC Shrinkage

Raw fold changes are noisy for low-count genes. LFC shrinkage pulls unreliable estimates toward zero while leaving high-count genes minimally affected.

R05_shrinkage.R

resultsNames(dds)
# "dex_trt_vs_untrt" ← the coefficient we want

res_shrunk <- lfcShrink(dds,
  coef = "dex_trt_vs_untrt",
  type = "apeglm"   # recommended method
)

# Compare MA plots before vs after
par(mfrow = c(1, 2))
plotMA(res,        main = "Raw LFC",     ylim = c(-5, 5))
plotMA(res_shrunk, main = "Shrunken LFC", ylim = c(-5, 5))

💡

Use shrunken LFCs for ranking and visualization

Use res_shrunk for volcano plots, heatmaps, and ranking by effect size. Use original res padj for significance filtering. Shrinkage doesn't change which genes are significant — only stabilizes fold change magnitudes.

Section 11

Visualization

R06_viz.R

library(EnhancedVolcano)

# ── Volcano plot ─────────────────────────────────────────────────
EnhancedVolcano(res_shrunk,
  lab = res_shrunk$symbol,
  x = "log2FoldChange", y = "padj",
  pCutoff = 0.05, FCcutoff = 1,
  title = "Dexamethasone vs Untreated"
)

# ── Top 50 DEG heatmap ───────────────────────────────────────────
top50 <- head(order(res$padj), 50)
mat   <- assay(rld)[top50, ] - rowMeans(assay(rld)[top50, ])
pheatmap(mat,
  annotation_col = as.data.frame(colData(rld)[,c("dex","cell")]),
  show_rownames = FALSE,
  color = colorRampPalette(c("#1e3a5f","white","#92400e"))(100)
)

# ── Single gene: DUSP1 (top dex-regulated) ───────────────────────
plotCounts(dds, gene = which(res$symbol == "DUSP1"),
  intgroup = c("dex", "cell"), returnData = TRUE) |>
  ggplot(aes(dex, count, color = cell, group = cell)) +
  geom_point(size = 3) + geom_line() + scale_y_log10()

Section 12

GO & GSEA Enrichment

A DEG list tells you which genes change. Enrichment analysis tells you what biology those genes represent.

R07_enrichment.R

library(clusterProfiler); library(org.Hs.eg.db)

# ── ENSEMBL → ENTREZ ─────────────────────────────────────────────
entrez <- bitr(rownames(sig), fromType = "ENSEMBL",
               toType = "ENTREZID", OrgDb = org.Hs.eg.db)

# ── ORA: GO biological process ───────────────────────────────────
ego <- enrichGO(gene = entrez$ENTREZID,
  OrgDb = org.Hs.eg.db, ont = "BP",
  pAdjustMethod = "BH", pvalueCutoff = 0.05, readable = TRUE)
dotplot(ego, showCategory = 20)

# ── GSEA: ranked list (preferred for publications) ───────────────
gene_list <- sort(
  setNames(res_shrunk$log2FoldChange,
           mapIds(org.Hs.eg.db, rownames(res_shrunk),
                  "ENTREZID", "ENSEMBL")),
  decreasing = TRUE)
gene_list <- gene_list[!is.na(names(gene_list))]

gsea_res <- gseGO(geneList = gene_list, OrgDb = org.Hs.eg.db,
  ont = "BP", pvalueCutoff = 0.05, seed = TRUE)
gseaplot2(gsea_res, geneSetID = 1)

Section 13

Complex Designs

R08_designs.R

# Simple (unpaired)
DESeqDataSet(se, design = ~ dex)

# Paired — our airway design (controls for donor)
DESeqDataSet(se, design = ~ cell + dex)

# Interaction: does dex effect differ between cell lines?
dds_int <- DESeqDataSet(se, design = ~ cell * dex)
dds_int <- DESeq(dds_int)
results(dds_int, name = "cellN61311.dextrt")

# Batch correction in model formula (better than removing from counts)
# DESeqDataSet(se, design = ~ batch + condition)

💡

Model everything you can measure

Adding cell to the airway design removed donor variance from the error term, dramatically increasing power. Any known confounder — donor, batch, sex, plate — belongs in the design formula, not corrected from counts separately.

Watch Out

Common Pitfalls

These appear in published papers. Knowing them makes you a better analyst and reviewer.

Pre-normalizing counts

never input TPM or RPKM

DESeq2 requires raw integer counts. Normalized input breaks the statistical model entirely.

Using pvalue not padj

always use BH-adjusted

~1,650 false positives at p<0.05 with 33k tests. Always filter on padj (FDR).

Wrong reference level

always relevel()

Forgetting relevel() gives LFC in the wrong direction. Check: positive = higher in treatment.

n=2 per condition

minimum n=3 biological

DESeq2 needs replicates to estimate dispersion. With n=2, results are unreliable.

Ignoring batch effects

model in design formula

Uncorrected batch inflates false positives. Use ~ batch + condition, not correction on counts.

No fold change threshold

apply |LFC| > 1

Significant p-value alone ≠ biologically meaningful. Apply 2-fold threshold alongside padj.

Section 14

Publication Figures

🌋

Volcano

LFC vs -log10(padj)

🔵

PCA

Samples by condition

🔥

Heatmap

Top 50 DEGs

📊

MA plot

Before/after shrink

🟡

GO dotplot

Enriched processes

📈

GSEA plot

Running score

📉

Dispersion

QC: per-gene

🎻

Gene counts

Per sample counts

🎯

How this connects to the tutorial series

Bulk RNA-seq is the foundation. scRNA-seq pseudobulk uses the same DESeq2 model per cell type. CITE-seq adds protein-level DEG alongside RNA. Spatial transcriptomics uses pseudobulk DESeq2 per spatial domain. The negative binomial model and size factors reappear across all of them.