Install
openclaw skills install @wunianze666-netizen/visible-text-extractorExtract and reconstruct as much visible text as possible from webpage URLs, article pages, screenshots, long images, image directories, and GIFs. Use when the goal is not just raw OCR, but a clean, human-readable result with section grouping, OCR cleanup, deduplication, structured JSON, original reading-order reconstruction, and explicit uncertainty notes. Especially useful for WeChat articles, event posters, long screenshots, mixed text-plus-image pages, and cases where visible information must be preserved without dumping noisy OCR into the final answer.
openclaw skills install @wunianze666-netizen/visible-text-extractorUse this skill to turn a webpage article, URL, screenshot set, long image set, or local image collection into complete, readable, reusable text.
ffmpeg is availablescripts/extract_visible_text.pyscripts/postprocess_ocr_text.py — clean OCR output, merge broken spacing, remove obvious garbage, and regroup into readable sectionsscripts/extract_with_browser.js — browser-rendered fallback for JS-heavy pagesscripts/extract_gif_frames.sh — GIF frame extraction via ffmpegscripts/build_deliverable_docx.js — convert cleaned markdown into a Word documentscripts/build_transcript_docx.js — convert transcript-style markdown into a Word documentscripts/build_authorized_capture_docx.py — one-step pipeline for already-authorized browser pages, saved HTML, screenshots, and mixed inputs into clean markdown + JSON + Word deliverablescripts/extract_visible_text_deliverable.py — one-step pipeline from source input to clean markdown + JSON + Word deliverablescripts/extract_visible_text_transcript_deliverable.py — one-step pipeline for transcript-style full extraction outputscripts/extract_visible_text_reading_order_deliverable.py — one-step pipeline for reading-order transcript outputscripts/build_wechat_interleaved_docx.py — reconstruct WeChat article reading order by interleaving extracted body blocks and image OCR text in original flow orderscripts/ocr_high_accuracy.py — higher-accuracy OCR with preprocessing variants and segmented long-image handlingreferences/output-schema.md — target output structure and cleanup rulesreferences/deliverable-workflow.md — one-step deliverable workflow guidancereferences/troubleshooting.md — failure patterns, environment limits, and how to respond cleanlyreferences/product-positioning.md — what mature deliverable quality means for this skillreferences/generalization-plan.md — how to evolve the skill across travel deals, rule pages, event posters, and tutorial long imagesreferences/universal-article-extractor-spec.md — generalized capability contract for article, mixed-media, and screenshot-heavy extractionWhen raw OCR is noisy, do not stop at extraction.
For mp.weixin.qq.com URLs:
scripts/build_wechat_interleaved_docx.py when the task is specifically “keep original article order” for WeChat posts.blocked: true clearly instead of pretending success.Extract URL to markdown:
python3 {baseDir}/scripts/extract_visible_text.py \
--url 'https://example.com/post' \
--format markdown \
--output result.md
Extract URL to JSON:
python3 {baseDir}/scripts/extract_visible_text.py \
--url 'https://example.com/post' \
--format json \
--output result.json
Extract WeChat article with fallbacks:
python3 {baseDir}/scripts/extract_visible_text.py \
--url 'https://mp.weixin.qq.com/s/xxxx' \
--browser-fallback \
--page-screenshot-ocr \
--format markdown \
--output wechat.md
Extract local screenshot or long image:
python3 {baseDir}/scripts/extract_visible_text.py \
--image ./screenshot.png \
--ocr-images \
--format markdown \
--output image-result.md
Run OCR post-processing:
python3 {baseDir}/scripts/postprocess_ocr_text.py \
--input-json ./ocr-result.json \
--title 'Clean Result' \
--body-text 'Optional summary or body text' \
--output-json ./clean.json \
--output-markdown ./clean.md
Run the one-step deliverable pipeline:
python3 {baseDir}/scripts/extract_visible_text_deliverable.py \
--url 'https://mp.weixin.qq.com/s/xxxx' \
--browser-fallback \
--page-screenshot-ocr \
--ocr-images \
--dedupe \
--output-prefix ./deliverable/result
This should emit:
result.raw.jsonresult.clean.jsonresult.clean.mdresult.docxRun the already-authorized capture pipeline when the page can be opened in a browser or exported/saved first:
python3 {baseDir}/scripts/build_authorized_capture_docx.py \
--url 'https://example.com/page' \
--browser-capture \
--ocr-images \
--dedupe \
--output-prefix ./deliverable/captured
Useful cases:
Operational expectations for this pipeline:
Practical optimization rule:
--url webpage URL--text-file local plain text / markdown input--html-file local saved HTML page--image PATH add one local image or GIF; repeat as needed--image-dir DIR OCR all supported images / GIFs in a directory--format markdown|json output format--output PATH output file path--ocr-images OCR discovered or provided images--dedupe deduplicate repeated merged lines--browser-fallback use browser-rendered fallback for incomplete pages--page-screenshot-ocr OCR the browser full-page screenshot as a last resort--gif-mode none|placeholder conservative GIF handling modeDefault target: produce something a human can read comfortably and share without cleanup.
Release-quality target for article deliverables:
The skill should increasingly treat extraction as a full article understanding and recovery problem, not only a body scrape plus OCR problem:
When the user explicitly wants completeness, the skill must support a fuller extraction mode:
For clean article outputs, prefer a structure like:
For transcript outputs, prefer a structure like:
Mature-skill rule:
Read these references when needed:
references/output-schema.mdreferences/deliverable-workflow.mdreferences/troubleshooting.mdreferences/product-positioning.mdreferences/generalization-plan.mdreferences/universal-article-extractor-spec.mdocr-local skill or compatible Tesseract.js setup.playwright-core support.ffmpeg.