Article Extract

v1.0.0

提取微信公众号、博客、新闻等网页的正文内容,绕过反爬机制,纯文本输出。

0· 539·0 current·0 all-time
byCodePlayer@caozeal
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
!
Purpose & Capability
The SKILL.md claims '绕过微信公众号反爬机制' (bypass WeChat anti-scraping). The included script only uses urllib with a standard User-Agent and basic HTML parsing — there is no headless browser, cookie/session handling, proxy rotation, or any other anti-bot circumvention. The capability claim is therefore overstated and inconsistent with the code.
Instruction Scope
Runtime instructions simply run the provided Python script on a given URL. The script only issues an HTTP GET, parses HTML, strips scripts/styles, and prints text. It does not read local files, access environment variables, or send data to third-party endpoints beyond the target URL.
Install Mechanism
No install spec; the skill is instruction-only with a small Python script that uses only the standard library. No downloads from external URLs or package installs are required.
Credentials
The skill declares no required environment variables, credentials, or config paths. The code does not access environment variables or request secrets.
Persistence & Privilege
The skill does not request persistent presence (always=false) and contains no code that modifies agent settings or other skills. Autonomous invocation is allowed by default but is not compounded by other privileges.
What to consider before installing
The tool itself is a small, clear Python script that fetches a URL and extracts visible text; it does not exfiltrate secrets or contact unknown endpoints. However, the README's promise to 'bypass WeChat anti-scraping' is not supported by the implementation — it only sets a User-Agent and will likely fail on sites that require JavaScript, cookies, special headers, or other anti-bot measures. Before installing or using: (1) understand that it may not work on dynamic or protected pages and might require a headless browser or official APIs; (2) review and run the script locally to confirm behavior; (3) be aware of site terms-of-service and legal/privacy considerations when scraping content. If you need reliable scraping of protected pages, prefer well-maintained tooling (e.g., Playwright/Selenium or official APIs) and avoid assuming this script 'bypasses' anti-scraping on its own.

Like a lobster shell, security has layers — review code before you run it.

latestvk977cdxga5ehrwwqmkz462a1m182aq9f
539downloads
0stars
1versions
Updated 1mo ago
v1.0.0
MIT-0

Article Extract

网页文章内容提取工具。支持微信公众号、博客、新闻网站等,输出干净的纯文本内容。

特点

  • ✅ 绕过微信公众号反爬机制
  • ✅ 自动过滤脚本、样式、导航等无关内容
  • ✅ 纯 Python 实现,无需额外依赖
  • ✅ 支持任意网页 URL

安装

无需安装,直接使用 Python 3 运行。

使用

python3 skills/article-extract/scripts/extract.py <url>

示例

# 提取微信公众号文章
python3 skills/article-extract/scripts/extract.py "https://mp.weixin.qq.com/s/xxxxx"

# 提取博客文章
python3 skills/article-extract/scripts/extract.py "https://example.com/blog/post"

# 保存到文件
python3 skills/article-extract/scripts/extract.py "https://mp.weixin.qq.com/s/xxxxx" > article.txt

输出

工具会输出提取的纯文本内容到 stdout,可以通过重定向保存到文件:

python3 skills/article-extract/scripts/extract.py "https://..." > output.txt

原理

  1. 使用标准浏览器 User-Agent 发送 HTTP 请求
  2. 解析 HTML,过滤 <script><style><nav><footer> 等无关标签
  3. 提取正文文本并清理多余空格

限制

  • 需要目标网页允许标准浏览器访问
  • 对于需要登录或特殊权限的页面可能无法提取
  • 某些动态加载的内容(如无限滚动)可能无法完整提取

依赖

  • Python 3.6+
  • 无需第三方库(仅使用标准库)

作者

基于 OpenClaw 社区实践封装

Comments

Loading comments...