Install
openclaw skills install html-text-extractExtract main content text from an HTML page (URL, file, or stdin). Strips nav, footer, ads, and boilerplate. Pipes cleanly into readability_check or any text-analysis tool.
openclaw skills install html-text-extractExtract clean main content text from HTML pages, stripping navigation, footers, ads, sidebars, and other boilerplate. Uses trafilatura for content extraction — the same library most academic web-scraping pipelines use.
Use this skill when the user:
Run html_extract.py with one of:
python3 html_extract.py https://example.com/pagepython3 html_extract.py page.htmlcat page.html | python3 html_extract.py -Pipe the output into a downstream tool. The canonical pairing is the readability checker:
python3 html_extract.py https://example.com/article \
| python3 /path/to/readability_check.py -
Output format options:
--format txt (default) — plain text, ideal for readability/sentiment tools--format markdown — preserves headings and lists, ideal for LLM ingestion--format json — text plus extracted metadata (title, author, date if available)By default, plain text on stdout. Status and error messages go to stderr so piping stays clean.
open() and URLs to trafilatura.fetch_url(), both of which sanitise.