Deep Research Pipeline
Deep Research Pipeline turns broad questions into cited, publication-quality reports through a staged research workflow: planning, multi-query retrieval, chunk selection, analysis, reflection, writing, and optional verification.
It is designed for research that should not be answered from memory or a single search result. The pipeline keeps claims tied to sources, surfaces contradictions, tracks gaps, and can resume from checkpoints.
Why Use It
- Multi-stage research, not one-shot summarization — separate researcher, analyst, reflection, and writer stages.
- Citation integrity — findings and final claims trace back to URLs/sources.
- Reflection loops — the pipeline checks coverage and decides whether another cycle is needed.
- Portable LLM config — supports
LLM_API_KEY/LLM_API_BASE, OpenAI-compatible endpoints, or Z.AI GLM.
- Operational controls — checkpoint/resume, time limits, token budgets, output formats, and mock mode.
When to Use
Deep research, comprehensive analysis, literature reviews, competitive analysis, fact-checking, technology deep-dives — anything needing multiple sources, synthesis, and verified citations.
Quick Start
cd skills/deep-research
# Optional: configure any OpenAI-compatible provider
export LLM_API_KEY="your-key"
export LLM_API_BASE="https://api.example.com/v1"
export LLM_MODEL="your-model"
# Or use OpenAI-compatible env names
export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://api.example.com/v1"
# Run a report
python3 scripts/research_pipeline.py \
"Compare Vercel, Netlify, and Cloudflare Pages in 2026" \
--max-cycles 2 \
--format report \
--output report.md
# Test without API calls
python3 scripts/research_pipeline.py "test question" --mock --output report.md
If no universal/OpenAI-compatible variables are set, the skill still supports Z.AI via ZAI_API_KEY and ZAI_API_ENDPOINT.
Architecture
ORCHESTRATOR (you)
│
├── Plan → Decompose question into research dimensions
│
├── REFLECTION LOOP (0-8 cycles)
│ ├── Researcher Agent (parallel) → multi-query search + chunk selection
│ ├── Analyst Agent → dedupe + themes + contradictions
│ └── Reflection → coverage check, gap analysis, continue decision
│
├── Writer Agent → polished report (report/summary/brief/json)
│
└── Verify (optional) → adversarial fact-check
Key principle: Orchestrator NEVER searches directly. Clean output flows between stages only.
Two Modes
Mode 1: Full Pipeline CLI (Recommended)
Use the enhanced research_pipeline.py for automated end-to-end research:
# Full research with all features
python3 scripts/research_pipeline.py "What is the state of quantum computing in 2026?" \
--max-cycles 3 \
--output report.md \
--format report
# Mock mode (no API calls, for testing)
python3 scripts/research_pipeline.py "test question" --mock --output report.md
# With budget limits
python3 scripts/research_pipeline.py "question" \
--max-cycles 3 --time-limit 300 --token-limit 40000
# Resume from checkpoint
python3 scripts/research_pipeline.py "question" \
--resume checkpoint.json --output report.md
# Explicit dimensions
python3 scripts/research_pipeline.py "question" \
--dimensions architecture benchmarks limitations \
--output report.md --format summary
CLI Flags:
| Flag | Default | Description |
|---|
--max-cycles | 3 | Max research cycles (1-8) |
--mock | false | Use mock data, no API calls |
--output / -o | stdout | Output file path |
--format / -f | report | Output format: report, summary, brief, json |
--time-limit | 900 | Max seconds for entire pipeline |
--token-limit | 60000 | Max estimated tokens |
--checkpoint | none | Save checkpoints to path |
--resume | none | Resume from checkpoint file |
--dimensions | auto | Explicit research dimensions |
--no-parallel | false | Research dimensions sequentially |
Output formats:
report — Full markdown: Executive Summary → Key Findings → Detailed Analysis → Contradictions → Gaps → Sources → Methodology
summary — Executive summary + top 5 findings + sources
brief — Bullet-point format for quick scanning
json — Structured JSON with annotated findings and metadata
Mode 2: Orchestrated Sub-Agents (For complex research)
Use when you need fine-grained control over each stage or parallel dimension research with sub-agents.
Workflow (Orchestrated Mode)
Phase 1: Planning
- Analyze question, create slug, make
memory/research/<slug>/ directory
- Generate research plan with dimensions and questions
- Save to
plan.md
Phase 2: Research Cycle (repeat up to 8 times)
Step A: Spawn Researcher Agent(s)
Use sessions_spawn with a task brief (NOT the full query):
{
"dimension": "technical architecture",
"specific_questions": ["How does X work?", "What are Y's components?"],
"context_limit": 5000,
"max_sources": 10
}
Researcher agent does:
- Multi-query generation —
scripts/query_generator.py produces 3-5 variants
- Parallel search —
web_search for each variant
- Content fetching —
web_fetch for top results
- LLM chunk selection —
scripts/chunk_selector.py scores each chunk (≥0.7)
- Context expansion —
scripts/context_expander.py fetches surrounding content
- Output: JSON findings with citations
Can spawn 2-3 researcher agents in parallel for different dimensions.
Step B: Spawn Analyst Agent
After researcher(s) complete, spawn analyst with their combined output:
- Deduplicate overlapping findings
- Flag contradictions (explicit + implicit)
- Group into thematic clusters
- Identify gaps
- Output: Cleaned JSON + gap list
Step C: Run Reflection
After analyst completes, run scripts/reflection.py:
- What's covered? (themes + confidence scores)
- What gaps remain? (unanswered questions)
- What contradictions emerged?
- New directions discovered?
- Should continue? (coverage ≥ 0.8 + minor gaps → stop)
Save reflection to memory/research/<slug>/reflection-cycle-N.md
Continue Decision
- Coverage ≥ 0.8 AND gaps minor → proceed to Phase 3
- Major contradictions → spawn targeted researcher
- Significant gaps → another researcher cycle
- Hard stop at cycle 8
Phase 3: Write Report
Use the Writer Agent (scripts/writer.py) for publication-quality output:
# From Python
from writer import WriterAgent, OutputFormat, write_report
# Generate report using WriterAgent
agent = WriterAgent(use_llm=True)
result = agent.write_report(
analyst_output, # from analyst or run_analyst()
question="What is RAG?",
fmt=OutputFormat.REPORT,
)
# Or use convenience function
result = write_report(analyst_output, question, fmt="report")
# Save to file
from writer import save_report
save_report(result, "output/report.md")
Report features:
- 🟢🟡🟠🔴 Confidence indicators on every finding
[source_url] inline citations throughout
- ⚠️ Contradiction callout boxes where sources disagree
- Structured sections: Summary → Findings → Analysis → Contradictions → Gaps → Sources → Methodology
- Template-based fallback when no LLM available
Phase 4: Verify (optional sub-agent)
Spawn adversarial verifier:
- Anchor every claim to source
- Verify URLs with
web_fetch
- Remove unsourced claims
- Save to
review.md
Phase 5: Deliver
- Fix any FATAL issues from review
- Copy to
final.md
- Write
provenance.md (date, cycles, sources, verification status)
- Send summary to user
Python API
import sys, os
sys.path.insert(0, os.path.expanduser("~/.openclaw/workspace/skills/deep-research/scripts"))
from research_pipeline import run_enhanced_pipeline
result = run_enhanced_pipeline(
question="What is the state of quantum computing in 2026?",
max_cycles=3,
dimensions=["hardware", "algorithms", "applications", "challenges"],
mock_mode=False,
output_format="report",
time_limit=900,
token_limit=60000,
checkpoint_path="checkpoint.json", # auto-saves progress
parallel_dimensions=True, # parallel research per dimension
)
# result["report"] = markdown string
# result["cycles_completed"] = int
# result["final_coverage"] = float (0.0-1.0)
# result["metadata"] = dict with timing, findings count, etc.
Scripts
| Script | Purpose | Usage |
|---|
research_pipeline.py | Full pipeline orchestration | python3 scripts/research_pipeline.py "question" --max-cycles 3 |
query_generator.py | Generate 3-5 search query variants | python3 scripts/query_generator.py -q "..." |
chunk_selector.py | LLM scores chunks, filters by threshold | python3 scripts/chunk_selector.py -q "..." -c chunks.json |
context_expander.py | Fetch surrounding context for incomplete chunks | python3 scripts/context_expander.py -s selected.json -q "..." |
reflection.py | Mandatory gap/contradiction check | python3 scripts/reflection.py -q "..." -f findings.json -c 1 |
writer.py | Publication-quality report generation | from writer import WriterAgent, write_report |
analyst.py | Dedup + themes + contradictions (no API needed) | from analyst import analyze_findings |
researcher.py | Multi-source research orchestration | from researcher import research, research_dimension |
research_sources.py | Search adapters (web, GitHub, docs) | from research_sources import WebSearchSource |
fact-checker.py | Claim extraction + source ranking | python3 scripts/fact-checker.py "text" --sources '["url1"]' |
All LLM-enabled scripts use the shared provider-agnostic llm_client.py.
Provider resolution order:
LLM_API_KEY + LLM_API_BASE + optional LLM_MODEL
OPENAI_API_KEY + OPENAI_API_BASE / OPENAI_BASE_URL + optional OPENAI_MODEL
ZAI_API_KEY + optional ZAI_API_ENDPOINT / GLM_MODEL
If no key is configured, use --mock for local pipeline testing or rely on scripts with rule-based fallbacks where available.
Examples
Example 1: Quick Competitive Analysis
python3 scripts/research_pipeline.py \
"Compare Vercel vs Netlify vs Cloudflare Pages features and pricing 2026" \
--max-cycles 2 \
--dimensions features pricing performance ecosystem \
--format summary \
--output competitive-analysis.md
Example 2: Deep Technology Research
python3 scripts/research_pipeline.py \
"What is the current state of AI agent frameworks?" \
--max-cycles 4 \
--time-limit 600 \
--token-limit 80000 \
--checkpoint /tmp/ai-agents-checkpoint.json \
--format report \
--output ai-agents-research.md
Example 3: Literature Review (mock mode for testing)
python3 scripts/research_pipeline.py \
"What does the research say about transformer architecture efficiency?" \
--mock \
--max-cycles 3 \
--format report \
--output literature-review.md
Example 4: Bullet Brief for Quick Scanning
python3 scripts/research_pipeline.py \
"What are the latest developments in Rust web frameworks?" \
--max-cycles 2 \
--format brief \
--output rust-web-brief.md
Example 5: JSON Output for Programmatic Use
python3 scripts/research_pipeline.py \
"What is the market size of edge computing?" \
--max-cycles 2 \
--format json \
--output edge-computing-data.json
Integration with Night Shift
To queue research plans for Night Shift execution:
- Create a research plan file:
// memory/research/queued/<slug>.json
{
"question": "What is the state of quantum computing in 2026?",
"max_cycles": 3,
"dimensions": ["hardware", "algorithms", "applications"],
"output_format": "report",
"output_path": "memory/research/quantum-2026/final.md",
"time_limit": 600,
"created_at": "2026-04-25T06:00:00Z"
}
- Night Shift picks up queued plans and runs them via:
python3 scripts/research_pipeline.py "$QUESTION" \
--max-cycles $MAX_CYCLES \
--dimensions $DIMENSIONS \
--format $FORMAT \
--output $OUTPUT_PATH \
--time-limit $TIME_LIMIT
- Results are saved to
memory/research/<slug>/final.md with provenance metadata.
File Layout
memory/research/<slug>/
├── plan.md # Research plan with dimensions
├── reflection-cycle-1.md # Reflection after each cycle
├── reflection-cycle-2.md
├── researcher-output-*.json # Raw researcher findings
├── analyst-output.json # Merged/deduped findings
├── draft.md # First draft
├── brief.md # Verified brief
├── review.md # Adversarial review (optional)
├── final.md # Final report
├── provenance.md # Metadata + source verification status
└── checkpoint.json # Pipeline checkpoint (auto-saved)
Quick Mode
Skip sub-agents and the full pipeline. Do 5-10 searches yourself. Still use evidence tables, verify URLs, cite sources. Shorter, inline in chat.
Integrity Commandments
- Never fabricate a source — no URL = don't mention it
- Never claim existence without checking
- Never extrapolate unread details
- Read before summarizing
- No fake certainty — never say "verified" unless checked
- Never invent numbers/benchmarks/comparisons
- Separate observations from inferences
- Every claim traces to a source — citation integrity is mandatory
- Reflection is not optional — run it after every cycle
- Stage separation — orchestrator never searches, researchers never see full plan
Scale Decision
- Single fact → Quick Mode (3-10 tool calls, no sub-agents)
- 2-3 item comparison → 2 parallel researcher sub-agents, 2-3 cycles
- Broad/multi-faceted → 3-4 researcher sub-agents, 3-5 cycles
- PhD-level deep dive → 4+ researchers, 5-8 cycles
See Also
- DOCS.md — Full API reference, architecture diagrams, troubleshooting
- test_integration.py — 29 integration tests covering the full pipeline
- test_pipeline.py — 26 unit tests for individual components