LLM Knowledge Bases

Representation-first multimodal Markdown wiki runtime for Obsidian vaults, with standalone CLI, MCP server, and OpenClaw compatibility.

Install

openclaw plugins install clawhub:llm-knowledge-bases-plugin

LLM Knowledge Bases

Inspired by a public workflow shared by Andrej Karpathy (@karpathy). From raw text, PDFs, images, and structured data to a living Markdown wiki that compounds with every question.

@harrylabs/llm-knowledge-bases is the deterministic runtime behind that workflow. It ships as:

a standalone CLI for directly running the kb_* workflow
a stdio MCP server for Claude Code, Codex, Cursor, Gemini CLI, and other MCP-capable agents
a config generator for wiring that MCP server into different clients
an OpenClaw-compatible host entry for teams that also use OpenClaw

If you want the workflow-first entry point, start with the companion skill. Use this package when you want the underlying runtime as an installable CLI/MCP toolchain.

What 0.4.1 Implements

This release makes the runtime representation-first and explicitly multimodal:

a raw/wiki/schema operating model with runtime-owned structure and agent-owned synthesis
supported raw kinds for text (.md, .txt), PDFs, images (.png, .jpg, .jpeg, .webp, .gif, .svg), and structured data (.csv, .tsv, .json, .html)
manifest schema version 2, including raw_kind, mime_type, size_bytes, asset_refs, and stored representations
source-id repair through kb_repair_source_ids, so stale source doc ids, source note paths, and raw hashes can be repaired without throwing away readable existing ids
stable non-ASCII source ids plus deterministic repair workflows, so legacy src-untitled-* records are migrated forward instead of being preserved by stale manifest state
safe raw-asset inspection through kb_get_raw_asset, including deterministic metadata plus a safe absolute path for local viewers
full compile context through kb_prepare_source_bundle, including asset refs, stored representations, and compile_readiness
runtime-managed representation storage under .llm-kb/representations/ through kb_prepare_representation, kb_upsert_representation, and kb_read_representations
compile-readiness tracking with ready, partial, and needs_representation
source note validation that keeps raw_kind, mime_type, and asset_paths aligned with the actual reviewed assets
archived output notes plus first-class concept, entity, and synthesis note support
deterministic gap mapping and promotion through kb_map_gaps and kb_promote_gap
generated wiki/index.md, wiki/log.md, and collection indexes, now with raw-kind labels on source pages
deterministic lint for schema and wiki health, including warnings for missing representation trails, stale representations, inconsistent asset_paths, isolated pages, stale source coverage, unsupported claims, contradiction candidates, and missing high-value pages
CLI and MCP wrappers around the same runtime contract

Multimodal Ingest Model

The runtime now supports two ingest paths:

Text and structured data can still compile directly from raw/ with kb_prepare_source and kb_read_raw.
PDFs and images use a representation-first path:
- inspect the asset with kb_get_raw_asset
- inspect compile readiness with kb_prepare_source_bundle
- store intermediate OCR, vision, page notes, metadata, or profiles under .llm-kb/representations/
- compile the final source note only after the representation trail is present

The runtime intentionally does not perform OCR or vision itself. Instead, it gives agents a canonical place to store those intermediate artifacts and then validates that the final wiki pages stay grounded in them.

Default Vault Shape

<vault>/
  raw/
  wiki/
    sources/
    outputs/
    concepts/
    entities/
    syntheses/
    _indexes/
    index.md
    log.md
  .llm-kb/
    manifest.json
    runs.jsonl
    representations/

CLI Commands

The standalone CLI exposes the runtime surface directly:

llm-knowledge-bases kb_status --vault-root /vault
llm-knowledge-bases kb_list_raw --vault-root /vault --changed-only
llm-knowledge-bases kb_read_raw --vault-root /vault --raw-path raw/notes/example.md
llm-knowledge-bases kb_get_raw_asset --vault-root /vault --raw-path raw/papers/report.pdf
llm-knowledge-bases kb_prepare_source --vault-root /vault --raw-path raw/notes/example.md
llm-knowledge-bases kb_prepare_source_bundle --vault-root /vault --raw-path raw/papers/report.pdf
llm-knowledge-bases kb_prepare_representation --vault-root /vault --raw-path raw/papers/report.pdf --kind ocr_text
llm-knowledge-bases kb_upsert_representation --vault-root /vault --raw-path raw/papers/report.pdf --kind ocr_text --content '<markdown>'
llm-knowledge-bases kb_read_representations --vault-root /vault --raw-path raw/papers/report.pdf --kinds metadata,ocr_text
llm-knowledge-bases kb_upsert_source_note --vault-root /vault --raw-path raw/papers/report.pdf --markdown '<full markdown>'
llm-knowledge-bases kb_prepare_output --vault-root /vault --title 'Example Query' --query 'What are the tradeoffs?'
llm-knowledge-bases kb_upsert_output --vault-root /vault --markdown '<full markdown>'
llm-knowledge-bases kb_prepare_derived_note --vault-root /vault --kind concept --title 'Agent Memory'
llm-knowledge-bases kb_upsert_derived_note --vault-root /vault --markdown '<full markdown>'
llm-knowledge-bases kb_map_gaps --vault-root /vault --limit 10
llm-knowledge-bases kb_promote_gap --vault-root /vault --note-id synthesis-retrieval-vs-memory
llm-knowledge-bases kb_repair_source_ids --vault-root /vault
llm-knowledge-bases kb_repair_source_ids --vault-root /vault --apply
llm-knowledge-bases kb_rebuild_indexes --vault-root /vault
llm-knowledge-bases kb_search --vault-root /vault --query 'agent memory' --types source,concept,synthesis
llm-knowledge-bases kb_read_notes --vault-root /vault --paths wiki/index.md,wiki/concepts/concept-agent-memory.md
llm-knowledge-bases kb_lint --vault-root /vault

MCP Tools

The MCP server exposes:

kb_status
kb_list_raw
kb_read_raw
kb_get_raw_asset
kb_prepare_source
kb_prepare_source_bundle
kb_prepare_representation
kb_upsert_representation
kb_read_representations
kb_upsert_source_note
kb_prepare_output
kb_upsert_output
kb_prepare_derived_note
kb_upsert_derived_note
kb_map_gaps
kb_promote_gap
kb_repair_source_ids
kb_rebuild_indexes
kb_search
kb_read_notes
kb_lint

Runtime Philosophy

The runtime owns:

canonical paths
canonical IDs
validation
deterministic writes
manifest-backed representation tracking
generated wiki navigation

The agent owns:

summarization
OCR, vision, or profiling work performed outside the runtime
synthesis
deciding whether a result belongs in output, concept, entity, or synthesis
improving the wiki over time instead of leaving value trapped in chat

kb_prepare_source_bundle is the bridge between those layers for non-text assets: it returns the exact raw metadata, reviewed asset refs, stored representations, and readiness state the agent needs before compiling a source note. kb_map_gaps and kb_promote_gap still cover durable knowledge growth on top of that ingest layer. kb_lint stays deterministic, but now also checks whether multimodal source notes have a believable review trail before the wiki starts depending on them.

Still Out of Scope

This package still does not implement:

embeddings or vector search
database-backed indexing
rename tracking
built-in OCR, vision, or PDF parsing inside the runtime itself
autonomous background agents inside the package