Markitdown-Skill-for-non-multimodal-agent

Other

Use when a NON-multimodal agent (a text-only LLM backend that cannot read attachments) receives a document — PDF, Word (docx), PowerPoint (pptx), Excel (xlsx/xls), EPUB, HTML, CSV, JSON, XML, or ZIP — and you need its CONTENT to read, summarize, quote, or store it. The model can't open the file itself, so convert it to Markdown first with the markitdown MCP server (local, free, no API key). An OPTIONAL OCR layer reads scanned PDFs and images via an OpenAI-compatible vision model, but ONLY when a key is configured. Skip for files the agent can already read as plain text, and for plain images when no OCR key is set.

Install

openclaw skills install markitdown-skill-for-non-multimodal-agent

markitdown — let a text-only agent read documents

A non-multimodal OpenClaw agent has no eyes: its backend is a plain-text API, so it cannot open a PDF / Word / Excel / PowerPoint attachment at all. This skill turns those files into Markdown the model can read.

Two layers, and you almost always only need the first:

  • Free layer — DEFAULT. The markitdown MCP server converts text-bearing documents (PDF, docx, pptx, xlsx, html, csv, json, xml, epub, zip) to Markdown locally. No API key. No per-call cost. This handles the vast majority of attachments, because most documents store real text.
  • OCR layer — OPT-IN. Scanned PDFs (photographed pages), standalone image files, and images embedded inside documents contain no extractable text — the only way to read them is to have a vision model look. This layer is OFF unless OPENAI_API_KEY is set, and it bills per image.

The cost line is simple: pulling existing text out of a file is free; asking a model to look at a picture costs money.

When to use

A document attachment arrives (Slack file, email attachment, a path the user gives you) whose extension or MIME is one of:

pdf · docx · pptx · xlsx · xls · epub · html · htm · csv · json · xml · zip · md …and you need its content (to answer about it, summarize it, quote it, or save it).

For plain images (png · jpg · jpeg · gif · webp · tiff): only useful if the OCR layer is on. With no key, a text-only agent simply cannot read an image — say so rather than guessing.

Do not use this for a file the agent can already read as plain text in the prompt.


Setup (operator, one time)

Free layer — the MCP server

Run the server over stdio with no install using uvx:

uvx markitdown-mcp

Register it in your OpenClaw / MCP client config:

{
  "mcpServers": {
    "markitdown": {
      "command": "uvx",
      "args": ["markitdown-mcp"]
    }
  }
}

This exposes one tool: convert_to_markdown(uri), where uri is any http:, https:, file:, or data: URI. That is the whole free layer.

The MCP server runs with the privileges of its process and can read any file that user can read. Keep it bound to local/stdio use only.

OCR layer — the CLI (optional)

The MCP server cannot OCR — it never wires up a vision client, so even with plugins enabled it silently returns text-only output. OCR runs through the CLI instead. Install the fork (which ships the markitdown-ocr plugin) plus an OpenAI client:

pip install "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]"
pip install "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr"
pip install openai

Then export a key (any OpenAI-compatible endpoint works):

export OPENAI_API_KEY=sk-...
export MARKITDOWN_OCR_MODEL=gpt-4o-mini   # cheapest vision model; optional

With no OPENAI_API_KEY, the plugin still loads but OCR is skipped — you fall back to the free converter automatically. So the OCR layer is genuinely zero-cost until someone opts in.


Flow (per attachment)

  1. Get the absolute path. The downloaded attachment's absolute path is already provided by the runtime (e.g. MediaPaths). Build a file://<absolute-path> URI.

    • ⚠️ Convert in the same turn the file arrives — downloads live in a temp dir and may be GC'd next turn.
  2. Free convert (always try first). Call convert_to_markdown("file://<abspath>") on the markitdown MCP server. For normal documents you are done — read or store the Markdown.

  3. Decide if OCR is needed. OCR only matters when:

    • the file is a standalone image, or
    • the free conversion came back empty / whitespace-only / a few stray characters (a tell-tale of a scanned PDF — pages are images, not text).

    If neither is true, stop. Don't spend a vision call on a document that already gave you text.

  4. OCR (only if needed AND OPENAI_API_KEY is set). Shell out to the CLI:

    markitdown "<abspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}" -o "<out>.md"
    

    Or, with no global install, one-shot via uvx:

    uvx --from "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]" \
        --with "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr" \
        --with openai \
        markitdown "<abspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}"
    

    OCR-extracted text is wrapped inline as *[Image OCR] … [End OCR]*, interleaved in reading order, so document structure is preserved.

    If OPENAI_API_KEY is NOT set and the content is image-only: do not pretend. Tell the user the file is image-based and reading it needs the optional OCR layer (an OpenAI-compatible key), and stop.

  5. Use or persist. One-off question → read the Markdown and answer; no need to save. Worth keeping → write it to your knowledge store with provenance (original filename, source, date).


Cost & model notes

  • Free layer: $0. Local text extraction, no network model.
  • OCR layer: one vision API call per image (and one per page for fully scanned PDFs, rendered at 300 DPI). With gpt-4o-mini this is roughly a fraction of a cent per image — cheap, but not zero, and it scales with image count. Pick a small vision model unless you need fidelity.
  • The OCR layer is the reason this fork exists: it gives a text-only agent a way to "see" images, on demand, without making the whole agent multimodal.

Gotchas

  • MCP ≠ OCR. Do not set MARKITDOWN_ENABLE_PLUGINS=true on the server expecting OCR — the server passes no llm_client, so it silently skips OCR. OCR is CLI-only.
  • Path access. Both the file:// input and any output path must be inside the server/agent's allowed root, or the call is blocked.
  • Encrypted / corrupt files can fail conversion. Report the failure plainly; for PDFs you can retry with a dedicated PDF tool if available.
  • Don't OCR what already has text. Step 3's check exists to avoid burning vision calls on ordinary documents.

Supported formats

Free (local): PDF, PowerPoint, Word, Excel, HTML, CSV, JSON, XML, EPUB, ZIP (iterates contents), plus text formats. OCR-enhanced (key required): scanned PDFs, standalone images, and images embedded in PDF/DOCX/PPTX/XLSX.


Built on microsoft/markitdown; OCR layer from the Self-made-Orange/markitdown fork (packages/markitdown-ocr).