# Analytical Plan — Step Reference # tcm-biomedical-research-strategist ## 9-Field Template (apply to every step) | Field | What to provide | |---|---| | **Step Name** | Short label | | **Purpose** | What this step accomplishes in the pipeline | | **Required Input** | Exact data / files / outputs from prior steps | | **Proposed Method** | Tool / algorithm + *why this instead of alternatives* | | **Key Parameters** | Thresholds, cutoffs, decision rules — be specific | | **Expected Output** | File format + content description | | **Connects To** | How this output feeds the next step | | **Failure Points** | What could go wrong; how to detect it | | **Alternative Method** | Backup tool/approach if primary fails | --- ## 14 Mandatory Steps ### Step 1 — Active Compound Collection & ADME Screening - Primary source: TCMSP (OB ≥ 30%, DL ≥ 0.18); fallback: HERB, SymMap - ⚠ **Multi-herb formulas**: deduplicate by InChIKey across all herbs; record herb-of-origin; prioritize compounds in ≥ 2 herbs as core candidates; document threshold conflicts - ⚠ **Sparse data (<5 compounds)**: escalate to ChEMBL / UNPD / literature mining; relax OB to ≥ 20% with justification; flag downstream uncertainty ### Step 2 — Target Prediction for Screened Compounds - Primary: Swiss Target Prediction (batch query); Secondary: SEA, SuperPred - Filter by probability score (≥ 0.1 recommended); record source for each prediction ### Step 3 — Disease Target Collection - Primary: GeneCards + DisGeNET (`disgenet2r` R package, CURATED evidence) - Secondary: OMIM, TTD - Use the disease name exactly as it appears in the database; retain evidence-level annotation ### Step 4 — Compound–Disease Target Intersection - Tool: R `VennDiagram` or `UpSetR`; or manual `intersect()` on gene symbol lists - Normalize all gene identifiers to HGNC official symbols before intersection - Output: candidate target list + Venn/UpSet figure ### Step 5 — PPI Network Construction & Topology Analysis - Tool: STRINGdb R package (`score_threshold = 400`); Cytoscape for visualization - Topology metrics: degree, betweenness centrality, closeness centrality via igraph - Hub definition: degree ≥ median + 1 SD or CytoHubba top-10 by MCC score - Output: network .cys file + hub gene ranked table ### Step 6 — Transcriptomic Dataset Selection & QC/Preprocessing - Source: GEO (≥ 30 samples/group preferred); TCGA-COAD/READ (CRC), TCGA-LIHC (HCC), TCGA-LUAD (NSCLC) - Platform: Illumina RNA-seq (counts) or Affymetrix microarray (RMA-normalized) - QC: PCA for batch effects; `arrayQualityMetrics` or `FastQC`; remove outlier samples - Output: normalized expression matrix + QC report ### Step 7 — Differential Expression Analysis (DEG Identification) - RNA-seq: DESeq2 (`|log2FC| > 1`, `padj < 0.05`) - Microarray: limma (`|log2FC| > 0.5`, `adj.P.Val < 0.05`) - Output: DEG table with log2FC, p-value, FDR; volcano plot ### Step 8 — WGCNA Co-expression Network & Module Extraction - Package: `WGCNA` R - Soft threshold: choose β where R² ≥ 0.85 (scale-free fit) - Minimum module size: 30 genes; merge threshold: 0.25 - Trait correlation: Pearson/Spearman between module eigengenes and disease trait - Output: module–trait heatmap; gene lists per module ### Step 9 — Candidate Target Prioritization - Intersect: PPI hub genes ∩ DEGs ∩ WGCNA trait-correlated module genes - Rank by: number of evidence layers hit (1 layer → 3 layers) - Output: prioritized candidate target list with evidence annotations ### Step 10 — GO / KEGG Enrichment - Package: `clusterProfiler` R; background: all expressed genes in dataset - GO thresholds: p.adjust < 0.05, q-value < 0.2 - KEGG thresholds: p.adjust < 0.05 - Visualization: dot plot (top 20 terms per category) - Output: enrichment tables + dot plots ### Step 11 — ML-Based Hub Gene Ranking - Methods (run ≥ 2 and take consensus): LASSO (`glmnet`), Random Forest (`randomForest` importance), SVM-RFE (`e1071`) - Validation: 5-fold cross-validation; AUC on held-out set - Output: ML feature importance ranking + AUC curves ### Step 12 — Immune Infiltration Analysis - Primary: CIBERSORT via TIMER2.0 web (LM22 signature) - Alternative: ssGSEA via `GSVA` package - Analysis: Spearman correlation between hub gene expression and immune cell fraction - ⚠ **Single-cell extension**: if scRNA-seq available via TISCH2, validate deconvolution at cell-type resolution using AUCell module scores - Output: immune fraction heatmap + correlation matrix ### Step 13 — Molecular Docking - Target preparation: retrieve PDB or AlphaFold structure; remove water/ligands; add hydrogens - Ligand preparation: PubChem SDF → Open Babel → .pdbqt - Docking: AutoDock Vina; define grid box on known binding domain or blind-dock - Acceptance threshold: ΔG < −5 kcal/mol; inspect top pose in PyMOL - Output: docking score table (compound, target, ΔG, key interacting residues) ### Step 14 — External Validation + Experimental Follow-up Design - Cross-dataset: validate hub gene expression direction in ≥ 1 independent GEO cohort - Experimental suggestions: cell viability (MTT/CCK8), Western blot for hub protein, siRNA knockdown + rescue - Output: cross-cohort validation table + experimental design proposal