Install
openclaw skills install clean-content-fetch获取干净、可读的网页正文内容,适合现代网页、博客、新闻、公告和微信公众号文章抓取;支持网页正文提取、内容清洗、去噪、Markdown 输出,适用于普通 fetch 效果不佳、页面噪音较多或动态渲染干扰的场景。Clean content fetch for modern web pages, article extraction, WeChat article capture, content cleanup, noise reduction, and markdown output when ordinary fetch is not clean enough.
openclaw skills install clean-content-fetch当用户要获取网页内容、正文提取、把网页转成 markdown/text、抓取文章主体时,优先使用此技能。
python3 scripts/scrapling_fetch.py <url> <max_chars>articlemain.post-content[class*="body"]html2text 转 Markdownbodymax_chars 截断输出python3 scripts/scrapling_fetch.py <url> 30000
常见依赖包括:
scraplinghtml2textcurl_cffiplaywrightbrowserforge建议在隔离环境中安装依赖,再运行脚本。若宿主环境限制系统级 pip 安装,可使用项目级虚拟环境。
示例:
python3 -m venv .venv
. .venv/bin/activate
pip install scrapling html2text curl_cffi playwright browserforge
python -m playwright install chromium
python scripts/scrapling_fetch.py <url> 30000
脚本默认输出 Markdown 正文内容。
如需结构化输出,可追加 --json。
如需调试提取命中了哪个 selector,可查看 stderr 输出。
references/usage.mdreferences/selectors.mdscripts/fetch-web-content对于 xhslink.com 短链或小红书笔记页,可直接运行:
python3 scripts/scrapling_fetch.py 'http://xhslink.com/o/9745hugimlD' 30000
说明: