Wechat Article To Markdown

v1.1.0

This skill converts WeChat Official Account (微信公众号) article pages into high-quality, clean Markdown format. It should be used when the user provides a WeChat...

0· 108·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for benzking/wechat-article-to-markdown-v2.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Wechat Article To Markdown" (benzking/wechat-article-to-markdown-v2) from ClawHub.
Skill page: https://clawhub.ai/benzking/wechat-article-to-markdown-v2
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install wechat-article-to-markdown-v2

ClawHub CLI

Package manager switcher

npx clawhub@latest install wechat-article-to-markdown-v2
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
The skill's name/description match the included Python script and reference docs: it fetches mp.weixin.qq.com pages, strips WeChat-specific noise, detects code blocks and images, and outputs Markdown. The code's network activity (page fetch + image downloads from CDN domains like mmbiz.qpic.cn) is consistent with the stated purpose.
Instruction Scope
SKILL.md describes running the included script and using convert_simple() programmatically; instructions do not ask the agent to read unrelated system files or solicit unrelated secrets. Note: SKILL.md's Step 1 omits installing Playwright and the necessary 'playwright install chromium' step even though the script's primary fetcher uses Playwright, which will cause runtime errors or surprising fallbacks to requests if Playwright is not present.
Install Mechanism
This is an instruction-only skill (no packaged installer). The code file uses standard Python libraries and Playwright; no external arbitrary downloads from unknown hosts are embedded in the skill. Because Playwright/Chromium may need to be installed manually, running the script can cause a Chromium download via Playwright tooling (this is expected for JS-rendered fetches).
Credentials
The skill declares no required environment variables or credentials and the code does not attempt to read secrets; network access is limited to fetching the article URL and its assets (images). No unrelated service tokens or config paths are requested.
Persistence & Privilege
always is false and the skill does not declare autonomous system-wide persistence. One implementation detail: Playwright is launched with launch_persistent_context(user_data_dir='') — using a persistent context may create a browser profile directory (depending on Playwright behavior) and could persist cookies/local storage between runs; this is plausible for convenience but worth noting.
Assessment
This skill appears to do what it says: fetch a WeChat article, clean it, convert it to Markdown, and optionally download images. Before running: 1) Be sure to install Playwright and run 'playwright install chromium' (SKILL.md omits this) or run with a mode that uses requests only; otherwise the script will fall back or error. 2) Expect network activity to the article host and image CDNs (e.g., mp.weixin.qq.com and mmbiz.qpic.cn) and disk writes for output and images — run inside an isolated environment if you want to limit exposure. 3) If you are concerned about persistent browser state, inspect/override the Playwright user_data_dir usage to avoid creating persistent profiles. 4) If you want higher assurance, review the remainder of the script (full file is large) or run it in a sandbox/VM. There are no red flags for credential exfiltration or unknown external endpoints in the provided files.

Like a lobster shell, security has layers — review code before you run it.

latestvk975sxz9czw9dpbyz99h0m10jh84mtts
108downloads
0stars
1versions
Updated 2w ago
v1.1.0
MIT-0

Changelog

v1.1.0 (2026-04-11)

修复:

  • fetch_with_playwright 改用移动端 Chromium(is_mobile=True + iPhone UA + 393×852 viewport),临时分享链接(tempkey)可正常渲染
  • 新增懒加载图片处理:滚动触发 data-src 图片加载
  • 新增「页面不存在」错误检测

对比(v1.0 → v1.1):

项目旧版新版
User Agent桌面 ChromeiPhone Safari
Viewport1280×900393×852
临时链接❌ 无法渲染✅ 正常
懒加载图片✅ 滚动触发

WeChat Article to Markdown

Overview

Convert WeChat Official Account articles (mp.weixin.qq.com) into clean, high-quality Markdown. The skill uses a Python script optimized for WeChat's unique DOM structure, featuring deep noise removal, smart code block detection, rich text preservation, and intelligent paragraph formatting.

Workflow

Decision Tree

User provides WeChat article URL?
├── Yes → Go to Step 1: Install Dependencies & Run Script
├── User wants to convert HTML directly?
│   └── Use Step 2: In-Line Conversion (for fetched HTML)
└── User asks about multiple URLs?
    └── Use batch mode with -f flag

Step 1: Install Dependencies & Convert

  1. Ensure Python dependencies are available. Install if missing:

    pip install requests beautifulsoup4 markdownify
    
  2. Run the conversion script:

    python scripts/wechat_to_md.py "<WECHAT_URL>" -o "<OUTPUT_DIR>"
    

    Options:

    • --no-images — Skip image downloading, keep remote URLs
    • --no-frontmatter — Omit YAML frontmatter
    • Multiple URLs: python scripts/wechat_to_md.py url1 url2 url3
  3. The output structure:

    <OUTPUT_DIR>/
    └── <Article_Title>/
        ├── <Article_Title>.md
        └── images/
            ├── img_000.png
            └── img_001.jpg
    

Step 2: In-Line Conversion (for Pre-Fetched HTML)

If the HTML has already been fetched (e.g., via web_fetch), use the script's convert_simple() function programmatically:

import sys
sys.path.insert(0, "<SKILL_DIR>/scripts")
from wechat_to_md import convert_simple

# 基础用法:仅转换,不下载图片
result = convert_simple("https://mp.weixin.qq.com/s/xxxxx")
markdown = result["markdown"]       # Full Markdown string
metadata = result["metadata"]       # {title, author, date, url, ...}
code_blocks = result["code_blocks"] # [{lang, code}, ...]
image_urls = result["image_urls"]   # 原始图片 URL 列表

# 高级用法:同时下载图片到本地
result = convert_simple(
    "https://mp.weixin.qq.com/s/xxxxx",
    download_imgs=True,              # 启用图片下载
    output_dir="./my_article"        # 指定输出目录(可选)
)
markdown = result["markdown"]        # 图片链接已替换为本地路径
image_mapping = result["image_mapping"]  # URL -> 本地路径映射
output_dir = result["output_dir"]    # 实际输出目录

Return the Markdown content directly to the user or write it to a file.

Step 3: Present Results

  • Display the generated Markdown file path to the user.
  • If the user wants to review the content, read the .md file and present a summary.
  • For batch conversions, report success/failure count.

Core Capabilities

1. Deep Noise Removal (WeChat-Specific)

The script removes 30+ WeChat-specific noise elements including:

  • Ad banners and promotional content (.mp_profile_iframe, #ad_content)
  • QR codes and reward/tip areas (.reward_area, .qr_code_pc)
  • Comment sections (#comment_container, #js_cmt_area)
  • Audio/video players (mpvoice, mpvideo)
  • Related article recommendations (#relation_article)
  • Tool bars, footers, copyright areas, tag sections
  • Hidden elements (display:none, visibility:hidden)
  • Empty <span> placeholders

2. Smart Code Block Detection

Handles all 3 WeChat code block formats:

  • pre.code-snippet with data-lang attribute
  • .code-snippet__fix container with nested pre[data-lang]
  • Generic pre[data-lang]

Features:

  • Auto-detects programming language from data-lang, CSS class, and code content
  • Removes line numbers (.code-snippet__line-index)
  • Filters CSS counter leaks (counter(line) garbage text)
  • Uses placeholder strategy: extract code blocks before conversion, restore after
  • Supports 25+ languages: Python, JavaScript, TypeScript, Go, Rust, Java, C, C++, SQL, HTML, CSS, JSON, YAML, Shell, Dockerfile, etc.

3. Rich Text Preservation

  • Bold/Italic: Normalizes <b><strong>, <i><em>, handles inline font-weight: bold
  • Lists: Converts WeChat marker-based lists (, ·, 1., (1)) to proper Markdown lists
  • Blockquotes: Detects left-border styled sections as blockquotes
  • Tables: Preserves table structure
  • Links: Preserves article links
  • Headings: Detects font-size based headings (≥22px → H2, ≥19px → H3)

4. Intelligent Paragraph Formatting

  • Fixes lazy-loaded images (data-srcsrc)
  • Cleans HTML entity residuals (&nbsp; → space, zero-width spaces removed)
  • Collapses excessive blank lines (max 2 consecutive)
  • Trims trailing whitespace per line
  • Proper spacing around code blocks
  • Full-width spaces → half-width spaces

5. Metadata Extraction

Generates YAML frontmatter:

---
title: "Article Title"
author: "Account Name"
date: "2026-04-08"
source: "https://mp.weixin.qq.com/s/xxxxx"
description: "Article description if available"
---

6. Image Handling

  • 自动下载:下载所有文章图片到 images/ 子目录
  • 并发下载:默认 5 个并发线程,支持重试机制(默认重试 2 次)
  • 格式检测:从 URL 和 Content-Type 自动检测图片格式
  • 链接替换:自动将 Markdown 中的远程 URL 替换为本地相对路径 (images/img_000.png)
  • URL 变体处理:智能处理微信图片 URL 的不同查询参数变体
  • 失败回退:下载失败时保留原始远程 URL
  • 文件验证:验证下载文件大小(过滤小于 100 字节的损坏文件)

图片下载增强功能:

# 下载图片并获取映射关系
from wechat_to_md import download_images, replace_image_urls

# 下载图片
url_to_local = download_images(
    img_urls=["https://mmbiz.qpic.cn/..."],
    output_dir=Path("./output"),
    concurrency=5,    # 并发数
    timeout=30,       # 超时时间(秒)
    retries=2         # 重试次数
)

# 替换 Markdown 中的图片链接
md = replace_image_urls(markdown, url_to_local)

Error Handling

ErrorCauseResolution
NetworkErrorHTTP failure, timeout, 404Retries 3x with exponential backoff
CaptchaErrorCaptcha page detectedInform user to wait and retry
ParseErrorContent element not foundCheck URL validity, may be restricted article
Missing dependenciespip install not runInstall: pip install requests beautifulsoup4 markdownify

Important Notes

  • Only supports mp.weixin.qq.com domain articles
  • Some code blocks are rendered as images/SVG — their source code cannot be extracted
  • Captcha pages may appear under high-frequency access; wait and retry
  • Public articles only — login-gated articles cannot be fetched
  • Respect original author copyright; for personal study/archiving use only

References

For detailed WeChat article DOM structure, selectors, and element handling, refer to:

  • references/wechat-dom-reference.md — Complete WeChat DOM structure documentation

Comments

Loading comments...