Install
openclaw skills install read-wechat-articleFetch and parse WeChat public articles by extracting clean content, metadata, images, word count, and reading time without browser rendering.
openclaw skills install read-wechat-article🎯 生产级微信公众号文章抓取和解析工具,符合Claw Hub发布标准
pip install -r requirements.txt
或直接安装:
pip install requests beautifulsoup4 markdownify
# 基本使用
python read_wechat_article.py "https://mp.weixin.qq.com/s/ijZyuHyubiX7Dp1tJrxZOw"
# 输出详细日志
python read_wechat_article.py "https://mp.weixin.qq.com/s/ijZyuHyubiX7Dp1tJrxZOw" -v
# 保存结果到文件
python read_wechat_article.py "https://mp.weixin.qq.com/s/ijZyuHyubiX7Dp1tJrxZOw" -o output.json
from read_wechat_article import read_wechat_article
# 公众号文章URL
url = "https://mp.weixin.qq.com/s/ijZyuHyubiX7Dp1tJrxZOw"
# 抓取并解析文章
result = read_wechat_article(url)
# 输出结果
print(f"标题: {result['title']}")
print(f"作者: {result['author']}")
print(f"发布时间: {result['publish_time']}")
print(f"字数: {result['word_count']:,}")
print(f"阅读时间: {result['read_time_minutes']}分钟")
print(f"图片数量: {len(result['images'])}")
print(f"Markdown内容: {result['content_markdown'][:500]}...")
from claw import skill
# 调用Skill
result = skill.run(
"read_wechat_article",
url="https://mp.weixin.qq.com/s/ijZyuHyubiX7Dp1tJrxZOw"
)
if result["success"]:
data = result["data"]
print(f"文章标题: {data['title']}")
else:
print(f"处理失败: {result['error']}")
{
"title": "未来1500天,影视行业的钱会被这1%的人赚走?",
"author": "郑林",
"publish_time": "2024-03-18 18:06",
"content_markdown": "# 未来1500天,影视行业的钱会被这1%的人赚走?\n\n在过去的三年里,影视行业经历了前所未有的挑战...",
"content_text": "未来1500天,影视行业的钱会被这1%的人赚走?\n\n在过去的三年里,影视行业经历了前所未有的挑战...",
"images": [
"https://mmbiz.qpic.cn/mmbiz_jpg/.../640",
"https://mmbiz.qpic.cn/mmbiz_jpg/.../640"
],
"original_url": "https://mp.weixin.qq.com/s/ijZyuHyubiX7Dp1tJrxZOw",
"word_count": 25306,
"read_time_minutes": 51
}
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Referer": "https://mp.weixin.qq.com/",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7"
}
TIMEOUT = 15 # 超时时间(秒)
RETRY_TIMES = 3 # 最大重试次数
RETRY_DELAY = 2 # 重试间隔时间(秒)
PUB_TIME_PATTERN = re.compile(r'(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2})') # 匹配发布时间
URL_CLEAN_PATTERN = re.compile(r'https://mp\.weixin\.qq\.com/s/[A-Za-z0-9_-]+') # 清理URL
自动去除URL中的多余参数,只保留核心部分:
原始URL: https://mp.weixin.qq.com/s/ijZyuHyubiX7Dp1tJrxZOw?from=groupmessage&isappinstalled=0
清理后: https://mp.weixin.qq.com/s/ijZyuHyubiX7Dp1tJrxZOw
HTTP请求 → HTML响应 → 标题提取 → 作者提取 → 时间提取 → 正文提取 → 内容清洗 → 格式转换 → 结果输出
本工具仅用于学习和研究目的,请勿用于非法用途。用户需自行承担因使用本工具而产生的法律责任。
def download_image(url: str, save_path: str):
"""下载图片到本地"""
response = requests.get(url, headers=HEADERS)
with open(save_path, 'wb') as f:
f.write(response.content)
def summarize_content(text: str, max_length: int = 500) -> str:
"""内容摘要"""
# 这里可以接入大模型实现智能摘要
return text[:max_length] + "..."
def save_to_database(result: Dict, db_conn):
"""保存到数据库"""
cursor = db_conn.cursor()
cursor.execute(
"INSERT INTO articles (title, author, content_text) VALUES (?, ?, ?)",
(result['title'], result['author'], result['content_text'])
)
db_conn.commit()
A: 这篇文章需要登录微信账号才能访问。可以尝试:
A: 可能是网络问题或服务器限制。可以尝试:
A: 可能是微信页面结构更新导致的。可以尝试:
A: 可能是网络延迟或页面加载慢。可以尝试:
MIT License - 详见LICENSE文件