Install
openclaw skills install graph-rag-builderBuilds a fully runnable MCP (Model Context Protocol) knowledge server from any website or documentation URL. Crawls the site, extracts concepts using Claude, organizes them into a knowledge graph, generates vector embeddings, and produces a ready-to-install MCP server with 8 semantic search and graph traversal tools. Use this skill whenever the user wants to: build an MCP server from a website or docs, create a semantic search index over documentation, make a knowledge graph from a library or framework's docs, turn a tutorial site into a Claude-searchable knowledge base, index learning materials for AI retrieval, or build a "smart docs" assistant for any topic. Trigger even if the user just mentions "scraping docs", "indexing a site", "building a knowledge graph", or "making docs searchable in Claude".
openclaw skills install graph-rag-builderTurns any documentation website into a runnable MCP knowledge server in 5 pipeline steps,
each run on the user's local machine using scripts in the scripts/ folder.
| Step | Script | What it does |
|---|---|---|
| M1 | crawl.py | BFS crawl → raw HTML + metadata per page |
| M2 | extract_concepts.py | HTML → chunks → LLM concept extraction |
| M3 | build_graph.py | Concepts + links → networkx knowledge graph |
| M4 | build_embeddings.py | Chunks + concepts → numpy vector index |
| M5 | generate_mcp_server.py | Graph + embeddings → standalone server.py |
All scripts require Python 3.10+ and auto-install their own dependencies on first run.
Before running anything, ask the user:
haiku (fast/cheap) or sonnet (higher quality). Default: haikuSet the output slug from the URL: https://strudel.cc → strudel-cc-mcp.
Provide this command for the user to run locally:
python scripts/crawl.py \
--url <URL> \
--max-depth <DEPTH> \
--output ./output
What to expect:
output/<slug>/raw_content/*.json (one per page)output/<slug>/crawl.json (state tracking)Common issues:
--rate-limit 1.5 to slow downpip install playwright && playwright install chromiumThe user must set ANTHROPIC_API_KEY first. Provide this command:
ANTHROPIC_API_KEY=sk-ant-... python scripts/extract_concepts.py \
--input ./output/<slug>-mcp \
--model haiku
Dry-run first (no API cost):
python scripts/extract_concepts.py --input ./output/<slug>-mcp --dry-run
This validates chunking quality before spending API budget. Show them chunk counts and section names from dry-run output.
What to expect:
output/<slug>-mcp/extracted/*.json (one per page)Common issues:
no_chunks → likely JS-rendered content not captured; acceptable for a minority of pages--max-pages 10 flag to test on a small sample firstRe-running after a partial run:
python scripts/extract_concepts.py --input ./output/<slug>-mcp --model haiku
# (automatically skips already-extracted pages)
Force re-extraction of everything:
python scripts/extract_concepts.py --input ./output/<slug>-mcp --force
python scripts/build_graph.py --input ./output/<slug>-mcp
What to expect:
extracted/*.json files())output/<slug>-mcp/graph.jsonHealthy output looks like:
Pages: 46
Chunks: 357
Concepts: 200+
Total edges: 1000+
MENTIONS 600+
REQUIRES 100+
HAS_CHUNK 357
LINKS_TO 80+
RELATED 40+
If concepts = 0 and "Skipped N dry-run files" appears, M2 hasn't been run with a real API key yet.
python scripts/build_embeddings.py --input ./output/<slug>-mcp
First run downloads all-MiniLM-L6-v2 (~80MB, cached after that).
Add --smoke-test to query both collections immediately after building:
python scripts/build_embeddings.py --input ./output/<slug>-mcp --smoke-test
What to expect:
output/<slug>-mcp/embeddings/ with 5 numpy files (no database needed)chunks (semantic search) and concepts (concept lookup)python scripts/generate_mcp_server.py --input ./output/<slug>-mcp
Outputs:
output/<slug>-mcp/server.py — the runnable MCP serveroutput/<slug>-mcp/mcp_config.json — Claude Desktop config snippetInstall into Claude Desktop:
~/Library/Application Support/Claude/claude_desktop_config.jsonmcp_config.json into the "mcpServers" keystrudel-cc) appears in Claude's available toolsTest the server standalone:
python output/<slug>-mcp/server.py
# Should print "Loading ... knowledge graph... Ready: N nodes, M edges"
Once installed, Claude can use these tools against the knowledge base:
| Tool | Description |
|---|---|
search(query, n=5) | Semantic search over all content chunks |
get_concept(name) | Concept details + chunks where it appears |
get_related(concept, n=5) | Related concepts via graph edges |
get_learning_path(start, goal) | Shortest concept path between topics |
get_prerequisites(concept) | What must be understood first |
get_examples(concept) | Code examples for a concept |
list_concepts(tag?, limit=20) | Browse all indexed concepts |
get_page(url) | All chunks for a specific doc page |
For a fresh install, provide the user with all commands in order:
# 0. Install system deps (once)
pip install requests beautifulsoup4 lxml playwright anthropic \
networkx numpy sentence-transformers mcp
playwright install chromium
# 1. Crawl
python scripts/crawl.py --url <URL> --max-depth 3 --output ./output
# 2. Extract concepts (dry-run first)
python scripts/extract_concepts.py --input ./output/<slug>-mcp --dry-run
# Then real run:
ANTHROPIC_API_KEY=sk-ant-... python scripts/extract_concepts.py \
--input ./output/<slug>-mcp --model haiku
# 3. Build graph
python scripts/build_graph.py --input ./output/<slug>-mcp
# 4. Build embeddings
python scripts/build_embeddings.py --input ./output/<slug>-mcp --smoke-test
# 5. Generate server
python scripts/generate_mcp_server.py --input ./output/<slug>-mcp
# 6. Test server
python output/<slug>-mcp/server.py
output/<slug>-mcp/
├── crawl.json State tracking (incremental re-runs)
├── raw_content/ One JSON per crawled page (HTML + links)
├── extracted/ One JSON per page (chunks + LLM concepts)
├── graph.json networkx knowledge graph
├── embeddings/ numpy indexes (chunks.npy, concepts.npy + JSON)
├── server.py The runnable MCP server ← share this
└── mcp_config.json Claude Desktop config snippet ← install this
The entire output/<slug>-mcp/ folder is the deliverable. The user can move it anywhere
as long as server.py, graph.json, and embeddings/ stay together.
"No module named X" → The script auto-installs deps, but if it fails:
pip install <package> --break-system-packages
Crawl gets 0 pages → Check robots.txt and try --force to bypass the crawl cache.
extract_concepts produces tiny concepts count → The page content may be JS-only.
Check fetched_with field in raw_content/*.json — pages fetched via requests with
very little text should have been picked up by Playwright. Re-crawl with --force.
Server fails to start → Run python output/<slug>-mcp/server.py directly and check
stderr for import errors. Most common cause: mcp package not installed.
Claude Desktop doesn't show the server → Verify the path in mcp_config.json is
absolute and the file exists. Restart Claude Desktop after any config change.
See TODO.md for planned improvements including: