GraphRAG Builder Skill
Turns any documentation website into a runnable MCP knowledge server in 5 pipeline steps,
each run on the user's local machine using scripts in the scripts/ folder.
Quick Reference
| Step | Script | What it does |
|---|
| M1 | crawl.py | BFS crawl → raw HTML + metadata per page |
| M2 | extract_concepts.py | HTML → chunks → LLM concept extraction |
| M3 | build_graph.py | Concepts + links → networkx knowledge graph |
| M4 | build_embeddings.py | Chunks + concepts → numpy vector index |
| M5 | generate_mcp_server.py | Graph + embeddings → standalone server.py |
All scripts require Python 3.10+ and auto-install their own dependencies on first run.
Step 0: Clarify Requirements
Before running anything, ask the user:
- URL: Which site to crawl (required — starting page)
- Depth: How many link-hops to follow (default 3; suggest 2 for large sites)
- Model: Which Claude model for concept extraction —
haiku (fast/cheap) or sonnet (higher quality). Default: haiku
Set the output slug from the URL: https://strudel.cc → strudel-cc-mcp.
Step 1: Crawl (M1)
Provide this command for the user to run locally:
python scripts/crawl.py \
--url <URL> \
--max-depth <DEPTH> \
--output ./output
What to expect:
- Creates
output/<slug>/raw_content/*.json (one per page)
- Creates
output/<slug>/crawl.json (state tracking)
- Prints a summary: pages crawled, JS fallbacks used, failures
Common issues:
- JS-heavy single-page apps → many Playwright fallbacks (normal, just slower)
- Rate limiting → add
--rate-limit 1.5 to slow down
- First run needs:
pip install playwright && playwright install chromium
Step 2: Extract Concepts (M2)
The user must set ANTHROPIC_API_KEY first. Provide this command:
ANTHROPIC_API_KEY=sk-ant-... python scripts/extract_concepts.py \
--input ./output/<slug>-mcp \
--model haiku
Dry-run first (no API cost):
python scripts/extract_concepts.py --input ./output/<slug>-mcp --dry-run
This validates chunking quality before spending API budget. Show them chunk counts and section names from dry-run output.
What to expect:
- Processes ~2–5 pages/minute on haiku
- Creates
output/<slug>-mcp/extracted/*.json (one per page)
- Each file contains chunks with: concepts, tags, code examples, prerequisites, relationships
- Skips already-extracted pages (safe to re-run after interruption)
Common issues:
- Pages showing
no_chunks → likely JS-rendered content not captured; acceptable for a minority of pages
- API rate limiting → script retries automatically with exponential backoff
--max-pages 10 flag to test on a small sample first
Re-running after a partial run:
python scripts/extract_concepts.py --input ./output/<slug>-mcp --model haiku
# (automatically skips already-extracted pages)
Force re-extraction of everything:
python scripts/extract_concepts.py --input ./output/<slug>-mcp --force
Step 3: Build Graph (M3)
python scripts/build_graph.py --input ./output/<slug>-mcp
What to expect:
- Reads all non-dry-run
extracted/*.json files
- Deduplicates concept names (case-insensitive, strips trailing
())
- Creates
output/<slug>-mcp/graph.json
- Prints node/edge counts by type
Healthy output looks like:
Pages: 46
Chunks: 357
Concepts: 200+
Total edges: 1000+
MENTIONS 600+
REQUIRES 100+
HAS_CHUNK 357
LINKS_TO 80+
RELATED 40+
If concepts = 0 and "Skipped N dry-run files" appears, M2 hasn't been run with a real API key yet.
Step 4: Build Embeddings (M4)
python scripts/build_embeddings.py --input ./output/<slug>-mcp
First run downloads all-MiniLM-L6-v2 (~80MB, cached after that).
Add --smoke-test to query both collections immediately after building:
python scripts/build_embeddings.py --input ./output/<slug>-mcp --smoke-test
What to expect:
- Creates
output/<slug>-mcp/embeddings/ with 5 numpy files (no database needed)
- Two indexes:
chunks (semantic search) and concepts (concept lookup)
Step 5: Generate MCP Server (M5)
python scripts/generate_mcp_server.py --input ./output/<slug>-mcp
Outputs:
output/<slug>-mcp/server.py — the runnable MCP server
output/<slug>-mcp/mcp_config.json — Claude Desktop config snippet
Install into Claude Desktop:
- Open
~/Library/Application Support/Claude/claude_desktop_config.json
- Merge the contents of
mcp_config.json into the "mcpServers" key
- Restart Claude Desktop
- The server name (e.g.,
strudel-cc) appears in Claude's available tools
Test the server standalone:
python output/<slug>-mcp/server.py
# Should print "Loading ... knowledge graph... Ready: N nodes, M edges"
The 8 MCP Tools
Once installed, Claude can use these tools against the knowledge base:
| Tool | Description |
|---|
search(query, n=5) | Semantic search over all content chunks |
get_concept(name) | Concept details + chunks where it appears |
get_related(concept, n=5) | Related concepts via graph edges |
get_learning_path(start, goal) | Shortest concept path between topics |
get_prerequisites(concept) | What must be understood first |
get_examples(concept) | Code examples for a concept |
list_concepts(tag?, limit=20) | Browse all indexed concepts |
get_page(url) | All chunks for a specific doc page |
Complete Pipeline Command Sequence
For a fresh install, provide the user with all commands in order:
# 0. Install system deps (once)
pip install requests beautifulsoup4 lxml playwright anthropic \
networkx numpy sentence-transformers mcp
playwright install chromium
# 1. Crawl
python scripts/crawl.py --url <URL> --max-depth 3 --output ./output
# 2. Extract concepts (dry-run first)
python scripts/extract_concepts.py --input ./output/<slug>-mcp --dry-run
# Then real run:
ANTHROPIC_API_KEY=sk-ant-... python scripts/extract_concepts.py \
--input ./output/<slug>-mcp --model haiku
# 3. Build graph
python scripts/build_graph.py --input ./output/<slug>-mcp
# 4. Build embeddings
python scripts/build_embeddings.py --input ./output/<slug>-mcp --smoke-test
# 5. Generate server
python scripts/generate_mcp_server.py --input ./output/<slug>-mcp
# 6. Test server
python output/<slug>-mcp/server.py
Output Directory Layout
output/<slug>-mcp/
├── crawl.json State tracking (incremental re-runs)
├── raw_content/ One JSON per crawled page (HTML + links)
├── extracted/ One JSON per page (chunks + LLM concepts)
├── graph.json networkx knowledge graph
├── embeddings/ numpy indexes (chunks.npy, concepts.npy + JSON)
├── server.py The runnable MCP server ← share this
└── mcp_config.json Claude Desktop config snippet ← install this
The entire output/<slug>-mcp/ folder is the deliverable. The user can move it anywhere
as long as server.py, graph.json, and embeddings/ stay together.
Troubleshooting
"No module named X" → The script auto-installs deps, but if it fails:
pip install <package> --break-system-packages
Crawl gets 0 pages → Check robots.txt and try --force to bypass the crawl cache.
extract_concepts produces tiny concepts count → The page content may be JS-only.
Check fetched_with field in raw_content/*.json — pages fetched via requests with
very little text should have been picked up by Playwright. Re-crawl with --force.
Server fails to start → Run python output/<slug>-mcp/server.py directly and check
stderr for import errors. Most common cause: mcp package not installed.
Claude Desktop doesn't show the server → Verify the path in mcp_config.json is
absolute and the file exists. Restart Claude Desktop after any config change.
Deferred Features
See TODO.md for planned improvements including:
- YouTube transcript fetching
- Neo4j export for large graphs
- OpenAI/Voyage embedding API support
- Scheduled re-crawls
- Graph visualization