Starlitnightly / bulk-rna-seq-differential-expression-with-omicverse

Guide Claude through omicverse's bulk RNA-seq DEG pipeline, from gene ID mapping and DESeq2 normalization to statistical testing, visualization, and pathway enrichment. Use when a user has bulk count matrices and needs differential expression analysis in omicverse.

0 views
0 installs

Skill Content

---
name: bulk-rna-seq-differential-expression-with-omicverse
title: Bulk RNA-seq differential expression with omicverse
description: "Bulk RNA-seq DEG pipeline: gene ID mapping, DESeq2 normalization, statistical testing, volcano plots, and pathway enrichment in OmicVerse."
---

# Bulk RNA-seq differential expression with omicverse

## Overview
Follow this skill to run the end-to-end differential expression (DEG) workflow showcased in [`t_deg.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_deg.ipynb). It assumes the user provides a raw gene-level count matrix (e.g., from featureCounts) and wants to analyse bulk RNA-seq cohorts inside omicverse.

## Instructions
1. **Set up the session**
   - Import `omicverse as ov`, `scanpy as sc`, and `matplotlib.pyplot as plt`.
   - Call `ov.plot_set()` so downstream plots adopt omicverse styling.
2. **Prepare ID mapping assets**
   - When gene IDs must be converted to gene symbols, instruct the user to download mapping pairs via `ov.utils.download_geneid_annotation_pair()` and store them under `genesets/`.
   - Mention the available prebuilt genomes (T2T-CHM13, GRCh38, GRCh37, GRCm39, danRer7, danRer11) and that users can generate their own mapping from GTF files if needed.
3. **Load the raw counts**
   - Read tab-delimited featureCounts output with `ov.pd.read_csv(..., sep='\t', header=1, index_col=0)`.
   - Strip trailing `.bam` segments from column names using list comprehension so sample IDs are clean.
4. **Map gene identifiers**
   - Run `ov.bulk.Matrix_ID_mapping(counts_df, 'genesets/pair_<GENOME>.tsv')` to replace `gene_id` entries with gene symbols.
5. **Initialise the DEG object**
   - Create `dds = ov.bulk.pyDEG(mapped_counts)`.
   - Handle duplicate gene symbols with `dds.drop_duplicates_index()` to keep the highest expressed version.
6. **Normalise and estimate size factors**
   - Execute `dds.normalize()` to calculate DESeq2 size factors, correcting for library size and batch differences.
7. **Run differential testing**
   - Collect treatment and control replicate labels into lists.
   - Call `dds.deg_analysis(treatment_groups, control_groups, method='ttest')` for the default Welch t-test.
   - Offer optional alternatives: `method='edgepy'` for edgeR-like tests and `method='limma'` for limma-style modelling.
8. **Filter and threshold results**
   - Note that lowly expressed genes are retained by default; filter using `dds.result.loc[dds.result['log2(BaseMean)'] > 1]` when needed.
   - Set dynamic fold-change and significance cutoffs via `dds.foldchange_set(fc_threshold=-1, pval_threshold=0.05, logp_max=6)` (`fc_threshold=-1` auto-selects based on log2FC distribution).
9. **Visualise differential expression**
   - Produce volcano plots with `dds.plot_volcano(title=..., figsize=..., plot_genes=... or plot_genes_num=...)` to highlight key genes.
   - Generate per-gene boxplots using `dds.plot_boxplot(genes=[...], treatment_groups=..., control_groups=..., figsize=..., legend_bbox=...)`; adjust y-axis tick labels if required.
10. **Perform pathway enrichment (optional)**
    - Download curated pathway libraries through `ov.utils.download_pathway_database()`.
    - Load genesets with `ov.utils.geneset_prepare(<path>, organism='Mouse'|'Human'|...)`.
    - Build the DEG gene list from `dds.result.loc[dds.result['sig'] != 'normal'].index`.
    - Run enrichment with `ov.bulk.geneset_enrichment(gene_list=deg_genes, pathways_dict=..., pvalue_type='auto', organism=...)`. Encourage users without internet access to provide a `background` gene list.
    - Visualise single-library results via `ov.bulk.geneset_plot(...)` and combine multiple ontologies using `ov.bulk.geneset_plot_multi(enr_dict, colors_dict, num=...)`.
11. **Document outputs**
    - Suggest exporting `dds.result` and enrichment tables to CSV for downstream reporting.
    - Encourage users to save figures generated by matplotlib (`plt.savefig(...)`) when running outside notebooks.
12. **Defensive validation**
    ```python
    # Before DEG: verify treatment/control groups exist as column names
    all_cols = set(dds.result.columns) if hasattr(dds, 'result') else set(counts_df.columns)
    for g in treatment_groups + control_groups:
        assert g in all_cols, f"Sample '{g}' not found in count matrix columns"
    # Verify groups don't overlap
    assert not set(treatment_groups) & set(control_groups), "Treatment and control groups must not overlap"
    ```
13. **Troubleshooting tips**
    - Ensure sample labels in `treatment_groups`/`control_groups` exactly match column names post-cleanup.
    - Verify required packages (`omicverse`, `pyComplexHeatmap`, `gseapy`) are installed for enrichment visualisations.
    - Remind users that internet access is required the first time they download gene mappings or pathway databases.

## Examples
- "I have a featureCounts matrix for mouse tumour samples—normalize it with DESeq2, run t-test DEG, and highlight the top 8 genes in a volcano plot."
- "Use omicverse to compute edgeR-style differential expression between treated and control replicates, then run GO enrichment on significant genes."
- "Guide me through converting Ensembl IDs to symbols, performing limma DEG, and plotting boxplots for Krtap9-5 and Lef1."

## References
- Detailed walkthrough notebook: [`t_deg.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_deg.ipynb)
- Sample count matrix for testing: [`sample/counts.txt`](../../sample/counts.txt)
- Quick copy/paste commands: [`reference.md`](reference.md)