GraphRAGBuilder

v1.0.0

Builds a fully runnable MCP (Model Context Protocol) knowledge server from any website or documentation URL. Crawls the site, extracts concepts using Claude,...

⭐ 0· 75·0 current·0 all-time

by@nacmonad

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for nacmonad/graph-rag-builder.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "GraphRAGBuilder" (nacmonad/graph-rag-builder) from ClawHub.
Skill page: https://clawhub.ai/nacmonad/graph-rag-builder
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install graph-rag-builder

ClawHub CLI

Package manager switcher

npx clawhub@latest install graph-rag-builder

Security Scan

Capability signals

CryptoCan make purchasesRequires sensitive credentials

These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

✓

Purpose & Capability

The name/description (build an MCP knowledge server from a URL) align with the included scripts: crawler, concept extractor (LLM calls), graph builder, embeddings builder, and MCP server generator. The pipeline and outputs described in SKILL.md match the code files present.

Instruction Scope

SKILL.md and README instruct the user to run the provided scripts locally and to set ANTHROPIC_API_KEY for concept extraction. However the skill metadata declared no required environment variables. The instructions also tell users to edit a specific local Claude Desktop config path (~/Library/Application Support/Claude/claude_desktop_config.json). The agent instructions are broad (crawl arbitrary sites) and will direct the user to run network- and I/O-heavy operations; that scope is expected for this task but the missing env-var declaration and the fact the code bootstraps installs at runtime are concerning.

Install Mechanism

There is no formal install spec, but many scripts perform runtime dependency bootstrapping via os.system('pip install ...'). generate_mcp_server's emitted server also attempts to auto-install missing packages. The crawler will call 'playwright install chromium' when needed. These runtime installs will fetch and execute code from PyPI and download large browser/model artifacts — this increases attack surface and may modify the system environment. The use of pip with --break-system-packages was observed in the code, which can affect system-managed Python packages; that's higher-risk than assuming a controlled virtualenv.

Credentials

Functionally the only actual secret required for normal operation is ANTHROPIC_API_KEY (used by extract_concepts.py and documented in .env.example and SKILL.md). However the skill's registry metadata lists no required env vars or primary credential — a mismatch. Aside from that single API key, the pipeline is local (uses local sentence-transformers by default) and does not require unrelated credentials. The omission of ANTHROPIC_API_KEY from the declared requirements is an inconsistency worth noting.

ℹ

Persistence & Privilege

The skill is not marked 'always:true' and autonomous invocation is allowed (the platform default). The skill's scripts create files in an output directory, download models and Chromium, and write a generated server.py that a user can run. The skill itself does not request system-wide persistent privileges or attempt to modify other skills' configs, but it does instruct users to merge generated JSON into a local Claude Desktop config file (a manual step). Consider the broader risk because the skill will cause persistent artifacts on disk and perform network downloads if executed.

What to consider before installing

What to consider before installing/running this skill: - The skill appears to do what it says: crawling, LLM-based concept extraction (Anthropic), graph construction, embeddings, and a generated MCP server. That part is coherent. - However: SKILL.md and the scripts require an ANTHROPIC_API_KEY (for concept extraction) but the skill metadata did not declare any required environment variables. Treat that as a metadata omission; you must supply an Anthropic key to run the core step. - The code auto-installs Python packages at runtime (pip install ...) and will install Playwright and Chromium and download the sentence-transformers model (~80MB) when run. Those actions fetch code/binaries from the internet and can modify your Python environment; run in an isolated environment (virtualenv, venv, or VM/container) and do not run as root. - Review the scripts before running. The auto-install and download behavior is explicit in the code (os.system calls). If you are uncomfortable with automatic installs, manually create a clean virtualenv, install the pinned requirements.txt, and run the scripts yourself. - Network and cost implications: the extraction step makes many LLM API calls (haiku/sonnet). Dry-run mode is available to validate chunking without using API credits — use it first. - Generated server.py runs locally and currently has no built-in authentication (TODO mentions adding optional API key). If you plan to expose the server beyond localhost, add authentication or run behind a firewall. - If you intend to crawl private or internal documentation, consider privacy and legal implications (and whether you should store that data). The crawler respects robots.txt per the plan, but you are still executing broad scraping on a target origin. Recommendations: run only in an isolated environment, inspect/modify the scripts to remove automatic pip installs if you prefer manual dependency control, ensure you provide and secure your ANTHROPIC_API_KEY, and verify generated server code before making it accessible beyond your machine.

Like a lobster shell, security has layers — review code before you run it.

latestvk97d7q3avj0wv6e32c9edmj9td85a9f4

75downloads

0stars

1versions

Updated 6d ago

v1.0.0

MIT-0

GraphRAG Builder Skill

Turns any documentation website into a runnable MCP knowledge server in 5 pipeline steps, each run on the user's local machine using scripts in the scripts/ folder.

Quick Reference

Step	Script	What it does
M1	`crawl.py`	BFS crawl → raw HTML + metadata per page
M2	`extract_concepts.py`	HTML → chunks → LLM concept extraction
M3	`build_graph.py`	Concepts + links → networkx knowledge graph
M4	`build_embeddings.py`	Chunks + concepts → numpy vector index
M5	`generate_mcp_server.py`	Graph + embeddings → standalone `server.py`

All scripts require Python 3.10+ and auto-install their own dependencies on first run.

Step 0: Clarify Requirements

Before running anything, ask the user:

URL: Which site to crawl (required — starting page)
Depth: How many link-hops to follow (default 3; suggest 2 for large sites)
Model: Which Claude model for concept extraction — haiku (fast/cheap) or sonnet (higher quality). Default: haiku

Set the output slug from the URL: https://strudel.cc → strudel-cc-mcp.

Step 1: Crawl (M1)

Provide this command for the user to run locally:

python scripts/crawl.py \
  --url <URL> \
  --max-depth <DEPTH> \
  --output ./output

What to expect:

Creates output/<slug>/raw_content/*.json (one per page)
Creates output/<slug>/crawl.json (state tracking)
Prints a summary: pages crawled, JS fallbacks used, failures

Common issues:

JS-heavy single-page apps → many Playwright fallbacks (normal, just slower)
Rate limiting → add --rate-limit 1.5 to slow down
First run needs: pip install playwright && playwright install chromium

Step 2: Extract Concepts (M2)

The user must set ANTHROPIC_API_KEY first. Provide this command:

ANTHROPIC_API_KEY=sk-ant-... python scripts/extract_concepts.py \
  --input ./output/<slug>-mcp \
  --model haiku

Dry-run first (no API cost):

python scripts/extract_concepts.py --input ./output/<slug>-mcp --dry-run

This validates chunking quality before spending API budget. Show them chunk counts and section names from dry-run output.

What to expect:

Processes ~2–5 pages/minute on haiku
Creates output/<slug>-mcp/extracted/*.json (one per page)
Each file contains chunks with: concepts, tags, code examples, prerequisites, relationships
Skips already-extracted pages (safe to re-run after interruption)

Common issues:

Pages showing no_chunks → likely JS-rendered content not captured; acceptable for a minority of pages
API rate limiting → script retries automatically with exponential backoff
--max-pages 10 flag to test on a small sample first

Re-running after a partial run:

python scripts/extract_concepts.py --input ./output/<slug>-mcp --model haiku
# (automatically skips already-extracted pages)

Force re-extraction of everything:

python scripts/extract_concepts.py --input ./output/<slug>-mcp --force

Step 3: Build Graph (M3)

python scripts/build_graph.py --input ./output/<slug>-mcp

What to expect:

Reads all non-dry-run extracted/*.json files
Deduplicates concept names (case-insensitive, strips trailing ())
Creates output/<slug>-mcp/graph.json
Prints node/edge counts by type

Healthy output looks like:

Pages:     46
Chunks:    357
Concepts:  200+
Total edges: 1000+
  MENTIONS       600+
  REQUIRES       100+
  HAS_CHUNK      357
  LINKS_TO       80+
  RELATED        40+

If concepts = 0 and "Skipped N dry-run files" appears, M2 hasn't been run with a real API key yet.

Step 4: Build Embeddings (M4)

python scripts/build_embeddings.py --input ./output/<slug>-mcp

First run downloads all-MiniLM-L6-v2 (~80MB, cached after that).

Add --smoke-test to query both collections immediately after building:

python scripts/build_embeddings.py --input ./output/<slug>-mcp --smoke-test

What to expect:

Creates output/<slug>-mcp/embeddings/ with 5 numpy files (no database needed)
Two indexes: chunks (semantic search) and concepts (concept lookup)

Step 5: Generate MCP Server (M5)

python scripts/generate_mcp_server.py --input ./output/<slug>-mcp

Outputs:

output/<slug>-mcp/server.py — the runnable MCP server
output/<slug>-mcp/mcp_config.json — Claude Desktop config snippet

Install into Claude Desktop:

Open ~/Library/Application Support/Claude/claude_desktop_config.json
Merge the contents of mcp_config.json into the "mcpServers" key
Restart Claude Desktop
The server name (e.g., strudel-cc) appears in Claude's available tools

Test the server standalone:

python output/<slug>-mcp/server.py
# Should print "Loading ... knowledge graph... Ready: N nodes, M edges"

The 8 MCP Tools

Once installed, Claude can use these tools against the knowledge base:

Tool	Description
`search(query, n=5)`	Semantic search over all content chunks
`get_concept(name)`	Concept details + chunks where it appears
`get_related(concept, n=5)`	Related concepts via graph edges
`get_learning_path(start, goal)`	Shortest concept path between topics
`get_prerequisites(concept)`	What must be understood first
`get_examples(concept)`	Code examples for a concept
`list_concepts(tag?, limit=20)`	Browse all indexed concepts
`get_page(url)`	All chunks for a specific doc page

Complete Pipeline Command Sequence

For a fresh install, provide the user with all commands in order:

# 0. Install system deps (once)
pip install requests beautifulsoup4 lxml playwright anthropic \
    networkx numpy sentence-transformers mcp
playwright install chromium

# 1. Crawl
python scripts/crawl.py --url <URL> --max-depth 3 --output ./output

# 2. Extract concepts (dry-run first)
python scripts/extract_concepts.py --input ./output/<slug>-mcp --dry-run
# Then real run:
ANTHROPIC_API_KEY=sk-ant-... python scripts/extract_concepts.py \
  --input ./output/<slug>-mcp --model haiku

# 3. Build graph
python scripts/build_graph.py --input ./output/<slug>-mcp

# 4. Build embeddings
python scripts/build_embeddings.py --input ./output/<slug>-mcp --smoke-test

# 5. Generate server
python scripts/generate_mcp_server.py --input ./output/<slug>-mcp

# 6. Test server
python output/<slug>-mcp/server.py

Output Directory Layout

output/<slug>-mcp/
├── crawl.json              State tracking (incremental re-runs)
├── raw_content/            One JSON per crawled page (HTML + links)
├── extracted/              One JSON per page (chunks + LLM concepts)
├── graph.json              networkx knowledge graph
├── embeddings/             numpy indexes (chunks.npy, concepts.npy + JSON)
├── server.py               The runnable MCP server ← share this
└── mcp_config.json         Claude Desktop config snippet ← install this

The entire output/<slug>-mcp/ folder is the deliverable. The user can move it anywhere as long as server.py, graph.json, and embeddings/ stay together.

Troubleshooting

"No module named X" → The script auto-installs deps, but if it fails:

pip install <package> --break-system-packages

Crawl gets 0 pages → Check robots.txt and try --force to bypass the crawl cache.

extract_concepts produces tiny concepts count → The page content may be JS-only. Check fetched_with field in raw_content/*.json — pages fetched via requests with very little text should have been picked up by Playwright. Re-crawl with --force.

Server fails to start → Run python output/<slug>-mcp/server.py directly and check stderr for import errors. Most common cause: mcp package not installed.

Claude Desktop doesn't show the server → Verify the path in mcp_config.json is absolute and the file exists. Restart Claude Desktop after any config change.

Deferred Features

See TODO.md for planned improvements including:

YouTube transcript fetching
Neo4j export for large graphs
OpenAI/Voyage embedding API support
Scheduled re-crawls
Graph visualization

Comments

Loading comments...