Install
openclaw skills install @vst93/vision-fallback-skillVision/image understanding for agents whose model can't read images (returns "model does not support images", empty/unknown output, low confidence, or user-reported failure). Calls the Volcengine Ark (doubao) vision API, returns structured JSON. Use whenever an image must be understood. Do NOT substitute with local OCR (tesseract) — OCR extracts text only, not layout/visual understanding.
openclaw skills install @vst93/vision-fallback-skillCalls the Volcengine Ark (doubao) vision API — NOT OpenRouter. Only credential needed:
ARK_API_KEY.
Use when ANY holds:
model does not support images, images are not supported, or refuses to
read the attached image)Otherwise do NOT use this skill.
Do NOT fall back to local OCR (tesseract, ocrmypdf, …) as a substitute.
OCR extracts text only — it cannot infer layout, control types (switch / radio /
card), or visual hierarchy. If the skill cannot run (see Preflight), stop and
tell the user the missing prerequisite (usually ARK_API_KEY) instead of
silently degrading to OCR.
./scripts/check.sh
Exits 0 only when all prerequisites are present (shell deps + ARK_API_KEY
resolved + endpoint reachable). If it fails, read its stderr, fix the reported
prerequisite, and re-run. Do not proceed to call-api.sh until check.sh
passes — a failed preflight means the API call will fail anyway.
image (required: file path / URL / data URL), ocr_text, failure_reason,
primary_model_output (all optional).
./scripts/check.sh. If non-zero, stop and report to the user (see
above) — do not fall back to OCR../scripts/call-api.sh "$IMAGE" "$OCR_TEXT" "$FAILURE_REASON" "$PRIMARY_OUTPUT"
— resolves ARK_API_KEY itself, converts the image to a data URL, assembles
the payload, and POSTs. See references/configuration.md
for key-resolution order.choices[0].message.content → structured JSON. Schema in
references/output-format.md.API endpoint/body/model note: references/api-reference.md.
This is the most common real-world trigger. In that case this skill is not a fallback, it is the vision layer — use it directly whenever the user provides an image that must be understood.