Section 01

What is Bulk RNA-seq?

Bulk RNA-seq measures the average gene expression across all cells in a sample simultaneously. You extract all the RNA from a tissue or cell culture, sequence it, and count how many reads map to each gene — producing one expression profile per sample.

Bulk RNA-seq is like recording the total noise level in a crowd — one number reflecting everyone together. scRNA-seq is like recording each person's voice individually. Bulk is faster, cheaper, and statistically more powerful for condition comparisons. Single-cell resolves cell-type heterogeneity that bulk misses entirely.
Tissue / Cells (two conditions) treated vs untreated · tumor vs normal · KO vs WT RNA extraction + library prep polyA selection · cDNA synthesis · adapter ligation Illumina sequencing → FASTQ paired-end reads · R1 + R2 per sample STAR alignment → BAM splice-aware alignment to reference genome featureCounts → count matrix genes × samples integer counts · DESeq2 input DESeq2 Analysis in R
Figure 1 — Complete bulk RNA-seq workflow. Upstream steps produce a genes × samples count matrix; DESeq2 handles everything from there.

Section 02

Bulk vs scRNA-seq

The two technologies answer different questions. Choose based on your biological question, not what's available.

FeatureBulk RNA-seqscRNA-seq
ResolutionTissue averageSingle cell
Statistical powerHigh (more reads/gene)Lower (sparse)
Cost per sampleLow (~$50–200)Higher (~$500–2000)
Cell type resolutionNone — all cells mixedFull breakdown
DEG analysisGold standard (DESeq2)Pseudobulk recommended
Best forCondition comparisonsCell type discovery
💡
Pseudobulk DEG — best of both worlds

If you have scRNA-seq data and want rigorous DEG, aggregate counts per cell type per donor (pseudobulk), then run DESeq2 on the result. Cell-type-specific DEGs with proper statistical modeling — current best practice.


Section 03

The Airway Dataset

Airway Smooth Muscle Cells + Dexamethasone
Himes et al. 2014 · PLOS Genetics · GSE52778 · Bioconductor airway package

Human airway smooth muscle cells from 4 donors treated with dexamethasone — a synthetic glucocorticoid with anti-inflammatory effects used in asthma treatment. Classic 2-condition paired design. Loads in one line of R.

8
Samples total
4
Cell lines
63,677
Genes
2
Conditions
→ Bioconductor: airway package
🔬
Understanding the paired design

4 donors × 2 conditions = 8 samples. Each donor contributes one treated and one untreated sample. Pairing removes donor-to-donor variation from the error term, giving substantially more statistical power. Modeled in DESeq2 as ~ cell + dex.

DONOR N61311 N052611 N080611 N61011 UNTREATED SRR1039508 SRR1039512 SRR1039516 SRR1039520 dex TREATED (DEX) SRR1039509 SRR1039513 SRR1039517 SRR1039521 pair 1
Figure 2 — Paired design. Each donor contributes one untreated and one dex-treated sample. Modeled as ~ cell + dex in DESeq2.

Section 04

Upstream: FASTQ → Count Matrix

The airway package provides pre-computed counts. For real data, this upstream processing runs on a compute cluster.

🔍
FastQC / MultiQC cluster

Check per-base quality, adapters, GC content. Trim with fastp if quality drops at read ends.

🗺️
STAR alignment → BAM cluster

Splice-aware alignment to reference genome. Expect >80% alignment rate for good data. Output: sorted BAM files per sample.

🔢
featureCounts → count matrix cluster

Count reads per gene using GTF annotation. Output: genes × samples integer matrix. Always use raw counts as DESeq2 input — never normalize first.

nf-core/rnaseq — recommended recommended

Nextflow pipeline wrapping all above steps plus MultiQC. Fully reproducible, one command, outputs count matrix ready for DESeq2.


Section 05

What DESeq2 Actually Does

DESeq2 is not a t-test on counts. Its statistical model explains why it works better — especially with small sample sizes (n=3).

RNA-seq counts follow a negative binomial distribution — like Poisson but with overdispersion. DESeq2 models this explicitly: estimates dispersion per gene, shrinks toward a fitted trend, fits a GLM, tests with Wald test. The result: accurate p-values even with n=3.
Size factors median-of-ratios normalize depth Dispersion gene-wise estimate shrink to trend GLM fitting negative binomial log2 fold changes Wald test p-values per gene BH → FDR Why not a t-test? Counts ≠ normally distributed · variance depends on mean n=3 too small for standard tests · DESeq2 borrows across genes
Figure 3 — DESeq2's four-step model. Each step borrows information across all genes (empirical Bayes) — critical for reliable results with small n.

Three concepts to understand

Size factors
≠ TPM or FPKM

Median-of-ratios normalization. Always provide raw integer counts to DESeq2. Never pre-normalize.

Dispersion
gene-specific variance

Each gene has its own variance. DESeq2 shrinks per-gene estimates toward a trend — what makes n=3 work.

Adjusted p-value
FDR, not raw p

33k tests → ~1650 false positives at p<0.05 by chance. Always filter on Benjamini-Hochberg padj.


Section 06

Load the Data

R01_load.R
# Install (run once)
BiocManager::install(c("DESeq2", "airway", "apeglm",
                       "org.Hs.eg.db", "clusterProfiler",
                       "EnhancedVolcano", "pheatmap"))

library(DESeq2); library(airway)
library(ggplot2); library(pheatmap)

# ── Load airway SummarizedExperiment ────────────────────────────
data(airway)
se <- airway

dim(assay(se))   # 63677 × 8
colData(se)       # cell + dex per sample

# ── Relevel: untreated as reference ─────────────────────────────
se$dex <- relevel(se$dex, ref = "untrt")
# Ensures positive LFC = higher in treated
⚠️
Always relevel your reference condition

DESeq2 uses the first factor level as the denominator. Without relevel(), LFCs may be reversed. Always explicitly set your control as the reference level.


Section 07

Exploratory Data Analysis

Always explore data visually before DEG testing. EDA reveals outliers, batch effects, and confirms groups separate as expected.

R02_eda.R
# ── Build DESeqDataSet ───────────────────────────────────────────
dds <- DESeqDataSet(se, design = ~ cell + dex)

# ── Pre-filter low-count genes ───────────────────────────────────
dds <- dds[rowSums(counts(dds)) >= 10, ]

# ── rlog transform for visualization ────────────────────────────
# rlog better for small n; VST faster for large n
rld <- rlog(dds, blind = TRUE)

# ── PCA ──────────────────────────────────────────────────────────
plotPCA(rld, intgroup = c("dex", "cell"))

# ── Sample distance heatmap ──────────────────────────────────────
sampleDists <- dist(t(assay(rld)))
pheatmap(as.matrix(sampleDists),
  clustering_distance_rows = sampleDists,
  clustering_distance_cols = sampleDists,
  annotation_col = as.data.frame(colData(rld)[, c("dex", "cell")])
)
ℹ️
rlog vs VST vs raw counts

Raw counts — DESeq2 input only, never visualize. rlog — regularized log, best for small n (<30 samples), use for PCA and heatmaps here. VST — faster, better for large n. Both are visualization-only — DESeq2 normalizes raw counts internally.


Section 08

Run DESeq2

R03_deseq.R
# ── Run (wraps size factors + dispersion + GLM) ──────────────────
dds <- DESeq(dds)

# ── Check dispersion plot — key QC step ─────────────────────────
plotDispEsts(dds)
# Gene-wise (black) scattered around trend (red)
# Final shrunken estimates (blue) pulled toward trend
# No trend visible → data quality issue

# ── Check size factors ───────────────────────────────────────────
sizeFactors(dds)
# Values near 1.0 = similar library sizes (good)

Section 09

Extract & Filter Results

R04_results.R
library(org.Hs.eg.db); library(AnnotationDbi)

res <- results(dds,
  contrast = c("dex", "trt", "untrt"),
  alpha = 0.05
)
summary(res)

# ── Add gene symbols ─────────────────────────────────────────────
res$symbol <- mapIds(org.Hs.eg.db,
  keys = rownames(res), keytype = "ENSEMBL",
  column = "SYMBOL", multiVals = "first")

# ── Filter significant DEGs: FDR + fold change ───────────────────
sig <- res[!is.na(res$padj) &
           res$padj < 0.05 &
           abs(res$log2FoldChange) > 1, ]

write.csv(as.data.frame(res[order(res$padj),]),
          "results/deseq2_results.csv")
⚠️
Always use padj — and report both thresholds

Raw p<0.05 gives ~1,650 false positives with 33k tests. Filter on padj < 0.05. Also apply |LFC| > 1 to remove statistically significant but biologically trivial changes. Report both in your methods.


Section 10

LFC Shrinkage

Raw fold changes are noisy for low-count genes. LFC shrinkage pulls unreliable estimates toward zero while leaving high-count genes minimally affected.

R05_shrinkage.R
resultsNames(dds)
# "dex_trt_vs_untrt" ← the coefficient we want

res_shrunk <- lfcShrink(dds,
  coef = "dex_trt_vs_untrt",
  type = "apeglm"   # recommended method
)

# Compare MA plots before vs after
par(mfrow = c(1, 2))
plotMA(res,        main = "Raw LFC",     ylim = c(-5, 5))
plotMA(res_shrunk, main = "Shrunken LFC", ylim = c(-5, 5))
💡
Use shrunken LFCs for ranking and visualization

Use res_shrunk for volcano plots, heatmaps, and ranking by effect size. Use original res padj for significance filtering. Shrinkage doesn't change which genes are significant — only stabilizes fold change magnitudes.


Section 11

Visualization

R06_viz.R
library(EnhancedVolcano)

# ── Volcano plot ─────────────────────────────────────────────────
EnhancedVolcano(res_shrunk,
  lab = res_shrunk$symbol,
  x = "log2FoldChange", y = "padj",
  pCutoff = 0.05, FCcutoff = 1,
  title = "Dexamethasone vs Untreated"
)

# ── Top 50 DEG heatmap ───────────────────────────────────────────
top50 <- head(order(res$padj), 50)
mat   <- assay(rld)[top50, ] - rowMeans(assay(rld)[top50, ])
pheatmap(mat,
  annotation_col = as.data.frame(colData(rld)[,c("dex","cell")]),
  show_rownames = FALSE,
  color = colorRampPalette(c("#1e3a5f","white","#92400e"))(100)
)

# ── Single gene: DUSP1 (top dex-regulated) ───────────────────────
plotCounts(dds, gene = which(res$symbol == "DUSP1"),
  intgroup = c("dex", "cell"), returnData = TRUE) |>
  ggplot(aes(dex, count, color = cell, group = cell)) +
  geom_point(size = 3) + geom_line() + scale_y_log10()

Section 12

GO & GSEA Enrichment

A DEG list tells you which genes change. Enrichment analysis tells you what biology those genes represent.

R07_enrichment.R
library(clusterProfiler); library(org.Hs.eg.db)

# ── ENSEMBL → ENTREZ ─────────────────────────────────────────────
entrez <- bitr(rownames(sig), fromType = "ENSEMBL",
               toType = "ENTREZID", OrgDb = org.Hs.eg.db)

# ── ORA: GO biological process ───────────────────────────────────
ego <- enrichGO(gene = entrez$ENTREZID,
  OrgDb = org.Hs.eg.db, ont = "BP",
  pAdjustMethod = "BH", pvalueCutoff = 0.05, readable = TRUE)
dotplot(ego, showCategory = 20)

# ── GSEA: ranked list (preferred for publications) ───────────────
gene_list <- sort(
  setNames(res_shrunk$log2FoldChange,
           mapIds(org.Hs.eg.db, rownames(res_shrunk),
                  "ENTREZID", "ENSEMBL")),
  decreasing = TRUE)
gene_list <- gene_list[!is.na(names(gene_list))]

gsea_res <- gseGO(geneList = gene_list, OrgDb = org.Hs.eg.db,
  ont = "BP", pvalueCutoff = 0.05, seed = TRUE)
gseaplot2(gsea_res, geneSetID = 1)

Section 13

Complex Designs

R08_designs.R
# Simple (unpaired)
DESeqDataSet(se, design = ~ dex)

# Paired — our airway design (controls for donor)
DESeqDataSet(se, design = ~ cell + dex)

# Interaction: does dex effect differ between cell lines?
dds_int <- DESeqDataSet(se, design = ~ cell * dex)
dds_int <- DESeq(dds_int)
results(dds_int, name = "cellN61311.dextrt")

# Batch correction in model formula (better than removing from counts)
# DESeqDataSet(se, design = ~ batch + condition)
💡
Model everything you can measure

Adding cell to the airway design removed donor variance from the error term, dramatically increasing power. Any known confounder — donor, batch, sex, plate — belongs in the design formula, not corrected from counts separately.


Watch Out

Common Pitfalls

These appear in published papers. Knowing them makes you a better analyst and reviewer.

Pre-normalizing counts
never input TPM or RPKM

DESeq2 requires raw integer counts. Normalized input breaks the statistical model entirely.

Using pvalue not padj
always use BH-adjusted

~1,650 false positives at p<0.05 with 33k tests. Always filter on padj (FDR).

Wrong reference level
always relevel()

Forgetting relevel() gives LFC in the wrong direction. Check: positive = higher in treatment.

n=2 per condition
minimum n=3 biological

DESeq2 needs replicates to estimate dispersion. With n=2, results are unreliable.

Ignoring batch effects
model in design formula

Uncorrected batch inflates false positives. Use ~ batch + condition, not correction on counts.

No fold change threshold
apply |LFC| > 1

Significant p-value alone ≠ biologically meaningful. Apply 2-fold threshold alongside padj.


Section 14

Publication Figures

🌋
Volcano
LFC vs -log10(padj)
🔵
PCA
Samples by condition
🔥
Heatmap
Top 50 DEGs
📊
MA plot
Before/after shrink
🟡
GO dotplot
Enriched processes
📈
GSEA plot
Running score
📉
Dispersion
QC: per-gene
🎻
Gene counts
Per sample counts
🎯
How this connects to the tutorial series

Bulk RNA-seq is the foundation. scRNA-seq pseudobulk uses the same DESeq2 model per cell type. CITE-seq adds protein-level DEG alongside RNA. Spatial transcriptomics uses pseudobulk DESeq2 per spatial domain. The negative binomial model and size factors reappear across all of them.