mp-weixin

v1.0.0

利用 Python 从微信公众号文章中提取元数据与正文内容。适用于用户需要解析微信文章链接(mp.weixin.qq.com)、提取文章信息(标题、作者、正文、发布时间、封面图),或将微信文章转换为结构化数据的情景。

0· 151·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for bozoyan/mp-weixin.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "mp-weixin" (bozoyan/mp-weixin) from ClawHub.
Skill page: https://clawhub.ai/bozoyan/mp-weixin
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required binaries: python3
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install mp-weixin

ClawHub CLI

Package manager switcher

npx clawhub@latest install mp-weixin
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (extracting metadata/content from mp.weixin.qq.com) matches the included Python scripts and declared dependencies (requests, BeautifulSoup, lxml). Required binary is python3 which is appropriate.
Instruction Scope
SKILL.md and the scripts only instruct the agent to fetch the provided WeChat article URL, parse HTML, and return structured fields. There are no instructions to read unrelated files, environment variables, or send data to external endpoints other than the target article URLs or user-configured proxies.
Install Mechanism
There is no automated installer; SKILL.md recommends installing common pip packages (beautifulsoup4, requests, lxml) from PyPI or a mirror. All code is bundled with the skill (no remote downloads or extract steps).
Credentials
The skill declares no required environment variables, credentials, or config paths. The code does not reference secrets or unrelated env vars; network access is limited to requests made to URLs supplied by the user.
Persistence & Privilege
always is false and the skill is user-invocable. It does not attempt to modify other skills or system-wide config; it only contains self-contained scripts.
Assessment
This skill appears to do exactly what it claims: fetch and parse WeChat public-article pages. Before running: (1) review the bundled Python files yourself (they are small and readable), (2) install pip packages from a trusted index, (3) avoid feeding it URLs you don't trust or content behind login/paywalls, and (4) respect WeChat's terms of service and rate-limit requests (the README already recommends sleep/proxy options). If you plan to run this on sensitive systems, execute it in a sandbox or isolated environment since it will make outbound HTTP requests to URLs you provide.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

📰 Clawdis
Binspython3
latestvk97d8y5qkj9gpymzzkrtp4dben8352cv
151downloads
0stars
1versions
Updated 1mo ago
v1.0.0
MIT-0

微信文章提取器 - Python 版本

使用 Python 提取微信公众号文章的标题、作者、内容、发布时间等元数据。

✅ 核心优势

相比 JavaScript 版本,Python 版本有以下优势:

  • 无需 npm 依赖:只用 Python 标准库 + 常用 pip 包
  • 安装快速pip install beautifulsoup4 requests lxml 即可完成
  • 绕过验证码:使用微信 User-Agent,提高访问成功率
  • 轻量级:脚本仅 10KB,无复杂依赖
  • 易维护:Python 代码更易读易改

📦 依赖安装

pip install beautifulsoup4 requests lxml

或使用国内镜像加速:

pip install beautifulsoup4 requests lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

🚀 使用方法

基本用法

python3 scripts/wechat_extractor.py <微信文章 URL>

示例

# 提取单篇文章
python3 scripts/wechat_extractor.py "https://mp.weixin.qq.com/s/xN1H5s66ruXY9s8aOd4Rcg"

# 在 Python 脚本中调用
from scripts.wechat_extractor import WeChatArticleExtractor

extractor = WeChatArticleExtractor()
result = extractor.extract('https://mp.weixin.qq.com/s/xxx')

if result['done']:
    print('标题:', result['data']['msg_title'])
    print('作者:', result['data']['msg_author'])
    print('内容:', result['data']['msg_content'][:500])

📊 输出数据说明

成功响应

{
  "done": true,
  "code": 0,
  "data": {
    // 文章信息
    "msg_title": "文章标题",
    "msg_desc": "文章摘要",
    "msg_content": "<div>HTML 内容</div>",
    "msg_cover": "封面图 URL",
    "msg_author": "作者",
    "msg_type": "post",
    "msg_publish_time_str": "2026/03/14 10:30:00",
    "msg_link": "文章链接",
    
    // URL 参数
    "msg_mid": "mid 参数",
    "msg_idx": "idx 参数",
    "msg_sn": "sn 参数",
    "msg_biz": "__biz 参数",
    
    // 公众号信息
    "account_name": "公众号名称",
    "account_alias": "微信号",
    "account_id": "原始 ID",
    "account_description": "功能介绍",
    "account_avatar": "头像 URL",
    
    // 版权信息
    "msg_has_copyright": true
  }
}

错误响应

{
  "done": false,
  "code": 1002,
  "msg": "请求超时"
}

⚠️ 错误码说明

错误码说明解决方案
1001无法获取文章信息检查 URL 是否正确
1002请求失败/超时检查网络连接,稍后重试
2006需要验证码微信反爬机制,稍后重试或使用已登录会话
2008系统出错查看错误详情,联系开发者

🎯 使用场景

适用场景

  • 提取微信公众号文章内容
  • 获取文章元数据(标题、作者、发布时间)
  • 批量采集微信文章
  • 监控特定公众号更新
  • 微信文章归档

不适用场景

  • 需要访问需要登录的付费文章
  • 需要绕过微信验证码的批量采集
  • 需要提取评论区内容

💡 最佳实践

1. 批量提取

urls = [
    'https://mp.weixin.qq.com/s/xxx1',
    'https://mp.weixin.qq.com/s/xxx2',
    'https://mp.weixin.qq.com/s/xxx3',
]

extractor = WeChatArticleExtractor(timeout=30)

for url in urls:
    result = extractor.extract(url)
    if result['done']:
        print(f"✅ {result['data']['msg_title']}")
    else:
        print(f"❌ {url}: {result['msg']}")
    
    # 避免请求过快
    import time
    time.sleep(1)

2. 保存为 JSON

import json

result = extractor.extract(url)

with open('article.json', 'w', encoding='utf-8') as f:
    json.dump(result, f, ensure_ascii=False, indent=2)

3. 提取纯文本内容

from bs4 import BeautifulSoup

result = extractor.extract(url)
html_content = result['data']['msg_content']

# 转为纯文本
soup = BeautifulSoup(html_content, 'lxml')
text_content = soup.get_text(separator='\n', strip=True)
print(text_content)

🔧 高级配置

自定义请求头

extractor = WeChatArticleExtractor(timeout=60)
extractor.session.headers.update({
    'User-Agent': '自定义 User-Agent'
})

使用代理

extractor = WeChatArticleExtractor(timeout=30)
extractor.session.proxies.update({
    'http': 'http://127.0.0.1:7890',
    'https': 'http://127.0.0.1:7890'
})

📝 注意事项

  1. 验证码问题:微信有反爬机制,频繁访问可能触发验证码,建议:

    • 控制请求频率(每秒不超过 1 次)
    • 使用固定的 User-Agent
    • 必要时使用已登录的 Cookie
  2. 内容完整性:部分文章可能包含视频、音频等多媒体内容,HTML 内容中会保留引用链接

  3. 发布时间:微信文章的发布时间可能无法精确提取,部分文章只能获取大致日期

  4. 公众号信息:公众号详细信息(微信号、原始 ID 等)需要从页面 JavaScript 中提取,可能不完整

🆚 与 JavaScript 版本对比

特性Python 版本JavaScript 版本
依赖pip (bs4, requests)npm (cheerio, request-promise)
安装速度快(<30 秒)慢(可能超时)
代码量~300 行~600 行
可维护性
验证码处理使用微信 UA 绕过需要额外配置
推荐度⭐⭐⭐⭐⭐⭐⭐⭐

📚 示例输出

正在提取文章:https://mp.weixin.qq.com/s/xN1H5s66ruXY9s8aOd4Rcg
---
✅ 提取成功!

📰 文章标题:4B 参数实现理解、推理、生成、编辑一体化!InternVL-U 重磅开源
👤 作者:书生 Intern
📢 公众号:书生 Intern
⏰ 发布时间:2026/03/14 10:30:00
📝 文章摘要:重新定义统一多模态模型的 "效率 - 性能" 边界。
🖼️ 封面图:https://mmbiz.qpic.cn/mmbiz_jpg/...
📄 文章类型:post
🔗 文章链接:https://mp.weixin.qq.com/s/xN1H5s66ruXY9s8aOd4Rcg

📊 公众号信息:
  - 名称:书生 Intern
  - 微信号:未设置
  - 原始 ID: 未设置
  - 功能介绍:未设置

📝 文章内容长度:129134 字符

💾 详细数据已保存到:/tmp/wechat_article.json

🔗 相关链接


Python 版本 by bozoyan · 2026-03-14 · 专为 CoPaw 优化 📰✨

Comments

Loading comments...