xiaofei自用-WeChat Article Scraper

v2.0.0

微信公众号文章抓取工具。从 mp.weixin.qq.com 抓取公开文章（文字+图片+视频），解析内容块顺序，下载图片，按原顺序写入飞书知识库。

⭐ 0· 113·0 current·0 all-time

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for mengzi53/xiaofei-ziyong-wechat-article-scraper.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "xiaofei自用-WeChat Article Scraper" (mengzi53/xiaofei-ziyong-wechat-article-scraper) from ClawHub.
Skill page: https://clawhub.ai/mengzi53/xiaofei-ziyong-wechat-article-scraper
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required binaries: python3, google-chrome
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install xiaofei-ziyong-wechat-article-scraper

ClawHub CLI

Package manager switcher

npx clawhub@latest install xiaofei-ziyong-wechat-article-scraper

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

high confidence

Purpose & Capability

The skill's name and description claim both scraping mp.weixin.qq.com and writing the article into Feishu (飞书) knowledge base. The included Python scripts implement scraping, HTML parsing, and local image downloading, but do not contain any Feishu API calls or logic to authenticate/write to Feishu. SKILL.md references commands like feishu_create_doc / feishu_update_doc / feishu_doc_media that are not provided and are not declared as required binaries or env vars. Requiring only python3 and google-chrome is appropriate for scraping but insufficient for the claimed Feishu import capability.

ℹ

Instruction Scope

Runtime instructions and scripts are narrowly scoped to fetching pages with headless Chrome, parsing HTML, extracting content blocks, and downloading images to a local cache directory; they do not read unrelated system files or environment variables. The scripts use aggressive anti-detection Chrome flags to bypass WeChat anti-scraping, which is expected for this purpose but may have legal/ToS implications. The SKILL.md asks the user to manually re-locate images in Feishu due to API limitations—consistent with the code producing Markdown and local images.

✓

Install Mechanism

There is no install specification (instruction-only skill plus included scripts). No external downloads or archive extraction occur. The only runtime dependency declared is google-chrome and python3, which matches the code's use of subprocess to invoke Chrome and Python execution.

Credentials

The skill requests no environment variables or credentials, yet advertises writing to Feishu. Writing to Feishu would normally require API credentials (app_id/app_secret or access token) or a feishu CLI binary; none are declared. SKILL.md references Feishu API actions but the package neither asks for nor provides a means to supply Feishu credentials or a Feishu client. This is a clear mismatch between claimed capabilities and required permissions/configuration.

✓

Persistence & Privilege

The skill does not request 'always: true' and does not modify other skills or global agent settings. It writes downloaded images to a local cache directory (default /tmp/wechat_article_<timestamp>) and otherwise prints outputs; this is normal for a scraping tool and does not escalate privileges.

What to consider before installing

This skill reliably implements scraping and local image download using headless Chrome, but its advertised Feishu import feature is not implemented in the provided code and no Feishu credentials or CLI are requested. Before installing or running: - Expect the tool to only scrape and produce local Markdown + images unless you separately provide a Feishu integration (a CLI or API glue that supplies credentials). The current package does not contain Feishu API calls. - Do not assume it will post data to Feishu automatically—verify how you or your environment will supply Feishu tokens if you want automated import. If there is an external feishu_create_doc tool you plan to use, ensure it is trustworthy. - Be aware it uses headless Chrome with anti-detection flags to bypass WeChat protection; scraping mp.weixin.qq.com may violate the site’s terms of service or local laws—consider legal/ethical implications. - Run the scripts in an isolated environment (e.g., disposable VM or container) until you confirm behavior. Check the cache directory (/tmp or supplied cache-dir) for downloaded images and remove sensitive files afterwards. - If you need automated Feishu import, request the author to either (a) include explicit Feishu integration code and declare required env vars (tokens) or (b) document the exact external CLI/tools required. Without that, the skill is incomplete and its description is misleading.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

Binspython3, google-chrome

latestvk9724mfvsx7cb707edcat719jn83veke

113downloads

0stars

1versions

Updated 1mo ago

v2.0.0

MIT-0

WeChat Article Scraper v2.0

微信公众号文章抓取 + 飞书导入工具。

工作流程

1. Chrome headless 抓取完整 HTML（绕过微信反爬检测）
2. 解析 HTML，提取有序内容块（文字/图片/GIF视频）
3. 下载所有文章图片到本地
4. 创建飞书知识库文档
5. 写入文字内容
6. 插入所有图片（仅支持末尾追加，这是飞书 API 限制）
7. 用户在飞书编辑器中手动拖动图片到正确位置

核心脚本

scrape.py — 纯抓取（不写飞书）

python3 ~/.openclaw/skills/wechat-article-scraper/scripts/scrape.py <文章链接>

输出 JSON：title / author / content / word_count / url

scrape_and_import.py — 完整流程

python3 ~/.openclaw/skills/wechat-article-scraper/scripts/scrape_and_import.py <文章链接> [--cache-dir /path] [--dry-run]

--dry-run：仅解析内容，不写飞书
自动下载图片到缓存目录
输出内容块结构和摘要

内容块解析逻辑

定位 id="js_content" 区域
用正则提取所有 data-src 图片（保持文章顺序）
每两张图片之间的 HTML 片段 → 清洗成纯文字
GIF 图片识别为视频预览（不下载）
跳过底部干扰内容（赞赏/留言/推荐阅读）

飞书写入流程（必须按此顺序）

feishu_create_doc          → 创建空白文档
feishu_update_doc(mode=overwrite)  → 写入所有文字内容
feishu_doc_media(insert)  → 逐一插入所有图片（末尾追加）

⚠️ 关键限制

图片只能追加到文档末尾，无法插入中间位置。

原因：飞书 doc_media/insert API 不支持 block_id / position 参数，只能在文档末尾追加图片 block。

解决方案：

先写完所有文字（图片占位符用括号注明位置，如"[图1：xxx]"）
再统一 insert 所有图片到末尾
用户打开飞书文档，手动拖动图片到对应文字旁边

这是飞书官方 API 限制，非工具实现问题。

图片顺序对应（文章块结构）

脚本会输出每个内容块的类型和预览文字，对应关系如下：

块序号	类型	说明
N	text	纯文字段落
N	image	文章图片（img_XXX.png）
N	video	GIF动态图=视频预览（不下载）

插入图片时按 img_001.png ~ img_NNN.png 文件名顺序对应文章中的图片顺序。

环境要求

google-chrome 已安装在 /usr/bin/google-chrome
反检测参数绕过微信 UA 检测：
- --user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36..."
- --disable-blink-features=AutomationControlled"
- --virtual-time-budget=20000（等待 JS 完全渲染）

Comments

Loading comments...