with the Airway Dataset
A complete guide to bulk RNA-seq analysis — from raw counts to publication figures and pathway enrichment. Based on the canonical airway dataset: human airway smooth muscle cells treated with dexamethasone.
What is Bulk RNA-seq?
Bulk RNA-seq measures the average gene expression across all cells in a sample simultaneously. You extract all the RNA from a tissue or cell culture, sequence it, and count how many reads map to each gene — producing one expression profile per sample.
Bulk vs scRNA-seq
The two technologies answer different questions. Choose based on your biological question, not what's available.
| Feature | Bulk RNA-seq | scRNA-seq |
|---|---|---|
| Resolution | Tissue average | Single cell |
| Statistical power | High (more reads/gene) | Lower (sparse) |
| Cost per sample | Low (~$50–200) | Higher (~$500–2000) |
| Cell type resolution | None — all cells mixed | Full breakdown |
| DEG analysis | Gold standard (DESeq2) | Pseudobulk recommended |
| Best for | Condition comparisons | Cell type discovery |
If you have scRNA-seq data and want rigorous DEG, aggregate counts per cell type per donor (pseudobulk), then run DESeq2 on the result. Cell-type-specific DEGs with proper statistical modeling — current best practice.
The Airway Dataset
Human airway smooth muscle cells from 4 donors treated with dexamethasone — a synthetic glucocorticoid with anti-inflammatory effects used in asthma treatment. Classic 2-condition paired design. Loads in one line of R.
4 donors × 2 conditions = 8 samples. Each donor contributes one treated and one untreated sample. Pairing removes donor-to-donor variation from the error term, giving substantially more statistical power. Modeled in DESeq2 as ~ cell + dex.
~ cell + dex in DESeq2.Upstream: FASTQ → Count Matrix
The airway package provides pre-computed counts. For real data, this upstream processing runs on a compute cluster.
Check per-base quality, adapters, GC content. Trim with fastp if quality drops at read ends.
Splice-aware alignment to reference genome. Expect >80% alignment rate for good data. Output: sorted BAM files per sample.
Count reads per gene using GTF annotation. Output: genes × samples integer matrix. Always use raw counts as DESeq2 input — never normalize first.
Nextflow pipeline wrapping all above steps plus MultiQC. Fully reproducible, one command, outputs count matrix ready for DESeq2.
What DESeq2 Actually Does
DESeq2 is not a t-test on counts. Its statistical model explains why it works better — especially with small sample sizes (n=3).
Three concepts to understand
Median-of-ratios normalization. Always provide raw integer counts to DESeq2. Never pre-normalize.
Each gene has its own variance. DESeq2 shrinks per-gene estimates toward a trend — what makes n=3 work.
33k tests → ~1650 false positives at p<0.05 by chance. Always filter on Benjamini-Hochberg padj.
Load the Data
# Install (run once) BiocManager::install(c("DESeq2", "airway", "apeglm", "org.Hs.eg.db", "clusterProfiler", "EnhancedVolcano", "pheatmap")) library(DESeq2); library(airway) library(ggplot2); library(pheatmap) # ── Load airway SummarizedExperiment ──────────────────────────── data(airway) se <- airway dim(assay(se)) # 63677 × 8 colData(se) # cell + dex per sample # ── Relevel: untreated as reference ───────────────────────────── se$dex <- relevel(se$dex, ref = "untrt") # Ensures positive LFC = higher in treated
DESeq2 uses the first factor level as the denominator. Without relevel(), LFCs may be reversed. Always explicitly set your control as the reference level.
Exploratory Data Analysis
Always explore data visually before DEG testing. EDA reveals outliers, batch effects, and confirms groups separate as expected.
# ── Build DESeqDataSet ─────────────────────────────────────────── dds <- DESeqDataSet(se, design = ~ cell + dex) # ── Pre-filter low-count genes ─────────────────────────────────── dds <- dds[rowSums(counts(dds)) >= 10, ] # ── rlog transform for visualization ──────────────────────────── # rlog better for small n; VST faster for large n rld <- rlog(dds, blind = TRUE) # ── PCA ────────────────────────────────────────────────────────── plotPCA(rld, intgroup = c("dex", "cell")) # ── Sample distance heatmap ────────────────────────────────────── sampleDists <- dist(t(assay(rld))) pheatmap(as.matrix(sampleDists), clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, annotation_col = as.data.frame(colData(rld)[, c("dex", "cell")]) )
Raw counts — DESeq2 input only, never visualize. rlog — regularized log, best for small n (<30 samples), use for PCA and heatmaps here. VST — faster, better for large n. Both are visualization-only — DESeq2 normalizes raw counts internally.
Run DESeq2
# ── Run (wraps size factors + dispersion + GLM) ────────────────── dds <- DESeq(dds) # ── Check dispersion plot — key QC step ───────────────────────── plotDispEsts(dds) # Gene-wise (black) scattered around trend (red) # Final shrunken estimates (blue) pulled toward trend # No trend visible → data quality issue # ── Check size factors ─────────────────────────────────────────── sizeFactors(dds) # Values near 1.0 = similar library sizes (good)
Extract & Filter Results
library(org.Hs.eg.db); library(AnnotationDbi) res <- results(dds, contrast = c("dex", "trt", "untrt"), alpha = 0.05 ) summary(res) # ── Add gene symbols ───────────────────────────────────────────── res$symbol <- mapIds(org.Hs.eg.db, keys = rownames(res), keytype = "ENSEMBL", column = "SYMBOL", multiVals = "first") # ── Filter significant DEGs: FDR + fold change ─────────────────── sig <- res[!is.na(res$padj) & res$padj < 0.05 & abs(res$log2FoldChange) > 1, ] write.csv(as.data.frame(res[order(res$padj),]), "results/deseq2_results.csv")
Raw p<0.05 gives ~1,650 false positives with 33k tests. Filter on padj < 0.05. Also apply |LFC| > 1 to remove statistically significant but biologically trivial changes. Report both in your methods.
LFC Shrinkage
Raw fold changes are noisy for low-count genes. LFC shrinkage pulls unreliable estimates toward zero while leaving high-count genes minimally affected.
resultsNames(dds) # "dex_trt_vs_untrt" ← the coefficient we want res_shrunk <- lfcShrink(dds, coef = "dex_trt_vs_untrt", type = "apeglm" # recommended method ) # Compare MA plots before vs after par(mfrow = c(1, 2)) plotMA(res, main = "Raw LFC", ylim = c(-5, 5)) plotMA(res_shrunk, main = "Shrunken LFC", ylim = c(-5, 5))
Use res_shrunk for volcano plots, heatmaps, and ranking by effect size. Use original res padj for significance filtering. Shrinkage doesn't change which genes are significant — only stabilizes fold change magnitudes.
Visualization
library(EnhancedVolcano) # ── Volcano plot ───────────────────────────────────────────────── EnhancedVolcano(res_shrunk, lab = res_shrunk$symbol, x = "log2FoldChange", y = "padj", pCutoff = 0.05, FCcutoff = 1, title = "Dexamethasone vs Untreated" ) # ── Top 50 DEG heatmap ─────────────────────────────────────────── top50 <- head(order(res$padj), 50) mat <- assay(rld)[top50, ] - rowMeans(assay(rld)[top50, ]) pheatmap(mat, annotation_col = as.data.frame(colData(rld)[,c("dex","cell")]), show_rownames = FALSE, color = colorRampPalette(c("#1e3a5f","white","#92400e"))(100) ) # ── Single gene: DUSP1 (top dex-regulated) ─────────────────────── plotCounts(dds, gene = which(res$symbol == "DUSP1"), intgroup = c("dex", "cell"), returnData = TRUE) |> ggplot(aes(dex, count, color = cell, group = cell)) + geom_point(size = 3) + geom_line() + scale_y_log10()
GO & GSEA Enrichment
A DEG list tells you which genes change. Enrichment analysis tells you what biology those genes represent.
library(clusterProfiler); library(org.Hs.eg.db) # ── ENSEMBL → ENTREZ ───────────────────────────────────────────── entrez <- bitr(rownames(sig), fromType = "ENSEMBL", toType = "ENTREZID", OrgDb = org.Hs.eg.db) # ── ORA: GO biological process ─────────────────────────────────── ego <- enrichGO(gene = entrez$ENTREZID, OrgDb = org.Hs.eg.db, ont = "BP", pAdjustMethod = "BH", pvalueCutoff = 0.05, readable = TRUE) dotplot(ego, showCategory = 20) # ── GSEA: ranked list (preferred for publications) ─────────────── gene_list <- sort( setNames(res_shrunk$log2FoldChange, mapIds(org.Hs.eg.db, rownames(res_shrunk), "ENTREZID", "ENSEMBL")), decreasing = TRUE) gene_list <- gene_list[!is.na(names(gene_list))] gsea_res <- gseGO(geneList = gene_list, OrgDb = org.Hs.eg.db, ont = "BP", pvalueCutoff = 0.05, seed = TRUE) gseaplot2(gsea_res, geneSetID = 1)
Complex Designs
# Simple (unpaired) DESeqDataSet(se, design = ~ dex) # Paired — our airway design (controls for donor) DESeqDataSet(se, design = ~ cell + dex) # Interaction: does dex effect differ between cell lines? dds_int <- DESeqDataSet(se, design = ~ cell * dex) dds_int <- DESeq(dds_int) results(dds_int, name = "cellN61311.dextrt") # Batch correction in model formula (better than removing from counts) # DESeqDataSet(se, design = ~ batch + condition)
Adding cell to the airway design removed donor variance from the error term, dramatically increasing power. Any known confounder — donor, batch, sex, plate — belongs in the design formula, not corrected from counts separately.
Common Pitfalls
These appear in published papers. Knowing them makes you a better analyst and reviewer.
DESeq2 requires raw integer counts. Normalized input breaks the statistical model entirely.
~1,650 false positives at p<0.05 with 33k tests. Always filter on padj (FDR).
Forgetting relevel() gives LFC in the wrong direction. Check: positive = higher in treatment.
DESeq2 needs replicates to estimate dispersion. With n=2, results are unreliable.
Uncorrected batch inflates false positives. Use ~ batch + condition, not correction on counts.
Significant p-value alone ≠ biologically meaningful. Apply 2-fold threshold alongside padj.
Publication Figures
Bulk RNA-seq is the foundation. scRNA-seq pseudobulk uses the same DESeq2 model per cell type. CITE-seq adds protein-level DEG alongside RNA. Spatial transcriptomics uses pseudobulk DESeq2 per spatial domain. The negative binomial model and size factors reappear across all of them.