Install
openclaw skills install markitdown-skill-for-non-multimodal-agentUse when a NON-multimodal agent (a text-only LLM backend that cannot read attachments) receives a document — PDF, Word (docx), PowerPoint (pptx), Excel (xlsx/xls), EPUB, HTML, CSV, JSON, XML, or ZIP — and you need its CONTENT to read, summarize, quote, or store it. The model can't open the file itself, so convert it to Markdown first with the markitdown MCP server (local, free, no API key). An OPTIONAL OCR layer reads scanned PDFs and images via an OpenAI-compatible vision model, but ONLY when a key is configured. Skip for files the agent can already read as plain text, and for plain images when no OCR key is set.
openclaw skills install markitdown-skill-for-non-multimodal-agentA non-multimodal OpenClaw agent has no eyes: its backend is a plain-text API, so it cannot open a PDF / Word / Excel / PowerPoint attachment at all. This skill turns those files into Markdown the model can read.
Two layers, and you almost always only need the first:
markitdown MCP server converts text-bearing documents (PDF, docx, pptx, xlsx, html, csv, json, xml, epub, zip) to Markdown locally. No API key. No per-call cost. This handles the vast majority of attachments, because most documents store real text.OPENAI_API_KEY is set, and it bills per image.The cost line is simple: pulling existing text out of a file is free; asking a model to look at a picture costs money.
A document attachment arrives (Slack file, email attachment, a path the user gives you) whose extension or MIME is one of:
pdf · docx · pptx · xlsx · xls · epub · html · htm · csv · json · xml · zip · md
…and you need its content (to answer about it, summarize it, quote it, or save it).
For plain images (png · jpg · jpeg · gif · webp · tiff): only useful if the OCR layer is on. With no key, a text-only agent simply cannot read an image — say so rather than guessing.
Do not use this for a file the agent can already read as plain text in the prompt.
Run the server over stdio with no install using uvx:
uvx markitdown-mcp
Register it in your OpenClaw / MCP client config:
{
"mcpServers": {
"markitdown": {
"command": "uvx",
"args": ["markitdown-mcp"]
}
}
}
This exposes one tool: convert_to_markdown(uri), where uri is any http:, https:, file:, or data: URI. That is the whole free layer.
The MCP server runs with the privileges of its process and can read any file that user can read. Keep it bound to local/stdio use only.
The MCP server cannot OCR — it never wires up a vision client, so even with plugins enabled it silently returns text-only output. OCR runs through the CLI instead. Install the fork (which ships the markitdown-ocr plugin) plus an OpenAI client:
pip install "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]"
pip install "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr"
pip install openai
Then export a key (any OpenAI-compatible endpoint works):
export OPENAI_API_KEY=sk-...
export MARKITDOWN_OCR_MODEL=gpt-4o-mini # cheapest vision model; optional
With no OPENAI_API_KEY, the plugin still loads but OCR is skipped — you fall back to the free converter automatically. So the OCR layer is genuinely zero-cost until someone opts in.
Get the absolute path. The downloaded attachment's absolute path is already provided by the runtime (e.g. MediaPaths). Build a file://<absolute-path> URI.
Free convert (always try first). Call convert_to_markdown("file://<abspath>") on the markitdown MCP server. For normal documents you are done — read or store the Markdown.
Decide if OCR is needed. OCR only matters when:
If neither is true, stop. Don't spend a vision call on a document that already gave you text.
OCR (only if needed AND OPENAI_API_KEY is set). Shell out to the CLI:
markitdown "<abspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}" -o "<out>.md"
Or, with no global install, one-shot via uvx:
uvx --from "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]" \
--with "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr" \
--with openai \
markitdown "<abspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}"
OCR-extracted text is wrapped inline as *[Image OCR] … [End OCR]*, interleaved in reading order, so document structure is preserved.
If OPENAI_API_KEY is NOT set and the content is image-only: do not pretend. Tell the user the file is image-based and reading it needs the optional OCR layer (an OpenAI-compatible key), and stop.
Use or persist. One-off question → read the Markdown and answer; no need to save. Worth keeping → write it to your knowledge store with provenance (original filename, source, date).
gpt-4o-mini this is roughly a fraction of a cent per image — cheap, but not zero, and it scales with image count. Pick a small vision model unless you need fidelity.MARKITDOWN_ENABLE_PLUGINS=true on the server expecting OCR — the server passes no llm_client, so it silently skips OCR. OCR is CLI-only.file:// input and any output path must be inside the server/agent's allowed root, or the call is blocked.Free (local): PDF, PowerPoint, Word, Excel, HTML, CSV, JSON, XML, EPUB, ZIP (iterates contents), plus text formats. OCR-enhanced (key required): scanned PDFs, standalone images, and images embedded in PDF/DOCX/PPTX/XLSX.
Built on microsoft/markitdown; OCR layer from the Self-made-Orange/markitdown fork (packages/markitdown-ocr).