# GEO Dataset Search Strategy and Bioinformatics Tool Reference

> Last verified: March 2026
> Use during Step 4 (Dataset & Preprocessing module) to identify suitable GEO datasets and verify tool availability.

---

## Part 1: GEO Dataset Search Strategy by Disease Class

### General Search Approach

1. Go to https://www.ncbi.nlm.nih.gov/geo/
2. Search: `"[disease name]" AND "Homo sapiens" AND "[tissue type]"`
3. Filter: DataSet Type = Expression profiling by array OR Expression profiling by high throughput sequencing
4. Sort by: Relevance or Sample Count (prefer n ≥ 20 per group)
5. Check GPL platform: prefer GPL570 (Affymetrix HG-U133 Plus 2.0) or GPL96 (HG-U133A) for microarray; prefer Illumina HiSeq for RNA-seq

### Minimum Dataset Requirements

| Configuration | Datasets per Disease | Minimum n per Group |
|---|---|---|
| Lite | 1 (discovery only) | ≥ 10 |
| Standard | 2 (discovery + validation) | ≥ 15 discovery, ≥ 10 validation |
| Advanced | 2–3 | ≥ 20 per group preferred |
| Publication+ | 3+ or cross-platform | ≥ 25 preferred; external replication encouraged |

### Platform Compatibility Rules

| Scenario | Action |
|---|---|
| Both datasets on same GPL | Direct merge after normalization |
| GPL570 + GPL96 | Subset to common probes; note probe coverage loss |
| Microarray + RNA-seq | Do NOT merge; analyze separately and compare DEG lists |
| Different species | Do NOT merge; use ortholog mapping only for supplementary analysis |

---

## Part 2: Recommended GEO Datasets by Disease Class (Examples)

> These are illustrative examples only. Always verify dataset suitability (tissue type, disease definition, control group) before use. Accession numbers may be superseded by newer datasets.

| Disease | Example GEO Accessions | Tissue | Platform |
|---|---|---|---|
| Intracranial aneurysm | GSE75436, GSE57691, GSE26979 | Arterial wall | GPL570, GPL96 |
| Abdominal aortic aneurysm | GSE57492, GSE47472, GSE98278 | Aortic tissue | GPL570, GPL6244 |
| Atherosclerosis / CAD | GSE20129, GSE34822, GSE43292 | Arterial plaque | GPL570 |
| Type 2 diabetes | GSE23343, GSE25724, GSE41762 | Adipose / islet | GPL570, GPL96 |
| NAFLD / NASH | GSE48452, GSE89632, GSE126848 | Liver biopsy | GPL570, GPL6244 |
| SLE | GSE72326, GSE49454, GSE81622 | PBMC / blood | GPL570, GPL6244 |
| Rheumatoid arthritis | GSE55235, GSE77298, GSE45291 | Synovial tissue | GPL570, GPL96 |
| Alzheimer's disease | GSE5281, GSE48350, GSE122063 | Brain (hippocampus/cortex) | GPL570, GPL96 |
| Parkinson's disease | GSE7621, GSE20163, GSE49036 | Substantia nigra / brain | GPL570, GPL96 |
| Colorectal cancer | GSE44076, GSE21510, GSE39582 | Colon tissue | GPL570, GPL6244 |
| C. difficile infection | GSE45301, GSE50190 | Colonic mucosa | GPL570 |
| IPF | GSE24206, GSE32537, GSE47460 | Lung tissue | GPL570 |
| CKD / diabetic nephropathy | GSE30528, GSE96804 | Kidney biopsy | GPL570 |
| Heart failure | GSE57338, GSE76701 | Cardiac tissue | GPL570 |

---

## Part 3: Bioinformatics Tool Reference

### Core Pipeline Tools

| Stage | Primary Tool | R Package / Access | Alternative |
|---|---|---|---|
| Data retrieval | GEOquery | `BiocManager::install("GEOquery")` | GEO2R (web interface) |
| Normalization (microarray) | limma / affy | `BiocManager::install("limma")` | RMA (affy package) |
| DEG analysis | limma | `BiocManager::install("limma")` | DESeq2 (RNA-seq) |
| GO/KEGG enrichment | clusterProfiler | `BiocManager::install("clusterProfiler")` | DAVID (web), enrichR |
| Pathway visualization | ggplot2 + clusterProfiler | CRAN | pathview |
| PPI construction | STRINGdb | `install.packages("STRINGdb")` | STRING web interface |
| Network visualization | Cytoscape | Desktop app (cytoscape.org) | igraph (R) |
| Hub module detection | MCODE | Cytoscape plugin | CluePedia |
| Hub gene ranking | CytoHubba | Cytoscape plugin (≥5 algorithms) | NetworkAnalyzer |
| ROC / AUC | pROC | `install.packages("pROC")` | ROCR |
| GSEA | clusterProfiler (GSEA function) | `BiocManager::install("clusterProfiler")` | fgsea |
| Co-expression | WGCNA | `install.packages("WGCNA")` | CEMiTool |
| Batch correction | ComBat (sva) | `BiocManager::install("sva")` | limma::removeBatchEffect |

### CytoHubba Algorithm Reference (≥5 Required)

| Algorithm | Measures | Strengths |
|---|---|---|
| Degree | Number of direct connections | Simple; captures highly connected nodes |
| MCC (Maximum Clique Centrality) | Clique-based centrality | Best for hub identification; preferred primary |
| Betweenness Centrality | Shortest paths through node | Identifies bottleneck/bridge nodes |
| Closeness Centrality | Average distance to all nodes | Captures globally central nodes |
| EPC (Edge Percolated Component) | Percolation-based robustness | Robust to network noise |
| Stress Centrality | Sum of shortest paths | Complementary to Betweenness |
| Radiality | Reachability weighted by distance | Captures peripheral connectivity |

**Minimum required**: Degree + MCC + Betweenness + Closeness + EPC (5 algorithms). Report consensus top genes across all applied algorithms.

---

## Part 4: GEO Data Quality Checklist

Before committing a dataset to the analysis, verify:

- [ ] Sample size: ≥ 10 per group for discovery; ≥ 10 for validation
- [ ] Disease definition: clinical diagnosis or pathological confirmation stated in Series description
- [ ] Control group: healthy controls or adjacent normal (not disease-adjacent tissue unless justified)
- [ ] Tissue type: matches intended tissue for this disease class
- [ ] Platform: GPL ID recorded; cross-dataset platform compatibility confirmed
- [ ] Availability: expression matrix downloadable (not raw FASTQ only)
- [ ] No obvious batch confound: check if all cases from one center and all controls from another