HTML to HTML

Web ScrapingDocument

Clean and restructure HTML documents using MinerU. Takes messy or complex HTML and produces clean, well-formatted HTML output with proper structure preserved. Features: HTML cleanup and restructuring. Removes unnecessary markup and noise. Preserves core content structure. Produces clean HTML from cluttered web pages. Use when you need to: clean up messy HTML, restructure an HTML document, convert complex HTML to clean HTML, sanitize HTML content. Use when asked: 'how do I clean this HTML', 'make this HTML cleaner', 'I want clean HTML from this page', 'can my agent clean up HTML', 'is there a skill for HTML cleanup', 'restructure this messy HTML'. Built on MinerU by OpenDataLab (Shanghai AI Lab), an open-source document intelligence engine. Great for web developers, content migration teams, and anyone who needs to clean up HTML from legacy systems, CMS exports, or messy web scraping results.

Install

openclaw skills install @mzlzyca/html-to-html

HTML to HTML

Fetch a remote web page or local HTML file and convert it to clean structured HTML using MinerU. Strips noise and preserves semantic content.

Install

npm install -g mineru-open-api
# or via Go (macOS/Linux):
go install github.com/opendatalab/MinerU-Ecosystem/cli/mineru-open-api@latest

Quick Start

# Crawl a web page and output clean HTML (requires token)
mineru-open-api crawl https://example.com/article -f html -o ./out/

# Re-extract a local HTML file to clean HTML (requires token)
mineru-open-api extract page.html -f html -o ./out/

# Batch crawl multiple URLs to HTML (requires token)
mineru-open-api crawl url1 url2 -f html -o ./pages/

Authentication

Token required:

mineru-open-api auth             # Interactive token setup
export MINERU_TOKEN="your-token" # Or via environment variable

Create token at: https://mineru.net/apiManage/token

Capabilities

Input: remote web page URL or local .html file
Output: clean structured HTML (-f html)
For remote URLs: use crawl -f html
For local HTML files: use extract -f html
Requires token — not available in flash-extract

Notes

HTML output (-f html) requires token; not available in flash-extract
crawl supports output formats: md, html, json
extract supports output formats: md, html, latex, docx, json
Output goes to stdout by default; use -o <dir> to save to a file or directory
All progress/status messages go to stderr; document content goes to stdout
MinerU is open-source by OpenDataLab (Shanghai AI Lab): https://github.com/opendatalab/MinerU

HTML to HTML

Install

HTML to HTML

Install

Quick Start

Authentication

Capabilities

Notes

Related skills