WeChat Article to Markdown

v1.0.0

This skill converts WeChat Official Account (微信公众号) article pages into high-quality, clean Markdown format. It should be used when the user provides a WeChat...

0· 131·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for benzking/wechat-to-markdown-converter.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "WeChat Article to Markdown" (benzking/wechat-to-markdown-converter) from ClawHub.
Skill page: https://clawhub.ai/benzking/wechat-to-markdown-converter
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install wechat-to-markdown-converter

ClawHub CLI

Package manager switcher

npx clawhub@latest install wechat-to-markdown-converter
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The name and description (WeChat article → Markdown) align with the included Python script and reference docs. The script implements DOM-specific cleaning, metadata extraction, code-block handling, and image downloading — all coherent with the stated goal.
Instruction Scope
SKILL.md instructs fetching a WeChat URL or supplying pre-fetched HTML, running the provided script, and optionally using convert_simple() programmatically. The runtime steps reference only article content, images, and generated .md files; there are no instructions to read unrelated system files, environment secrets, or to transmit data to unexpected endpoints.
Install Mechanism
No formal install spec in registry (instruction-only), but SKILL.md and the script require installing Python dependencies (beautifulsoup4, markdownify, requests) and optionally Playwright plus a Chromium download (playwright install chromium). This is expected for a headless-browser-based scraper but implies a large browser binary will be downloaded when using Playwright.
Credentials
The skill declares no required environment variables, credentials, or config paths. The script performs network requests to the provided WeChat URLs and their image CDN (mmbiz.qpic.cn) — appropriate for its purpose and proportionate to its functionality.
Persistence & Privilege
The skill does not request always-on presence and has default invocation behavior. It does not attempt to modify other skills or global agent configuration based on the provided files and instructions.
Assessment
This skill appears to do what it claims: convert WeChat articles to Markdown and optionally download images. Before installing or running it, consider: (1) It may require installing Playwright and downloading Chromium (a sizable binary) if you want JS-rendered pages; (2) The script will fetch whatever URL you give it and download images from WeChat's CDN — ensure you trust the input URLs and are comfortable with network activity; (3) The tool runs arbitrary Python code on your machine. If you have security concerns, run it in an isolated environment (container or VM) and inspect the full script. No credentials are requested and no suspicious external endpoints were found.

Like a lobster shell, security has layers — review code before you run it.

articlevk977z6ttxh3xkymq8tcrvxmkvh84jrzdconvertervk977z6ttxh3xkymq8tcrvxmkvh84jrzdlatestvk977z6ttxh3xkymq8tcrvxmkvh84jrzdmarkdownvk977z6ttxh3xkymq8tcrvxmkvh84jrzdwechatvk977z6ttxh3xkymq8tcrvxmkvh84jrzd
131downloads
0stars
1versions
Updated 2w ago
v1.0.0
MIT-0

WeChat Article to Markdown

Overview

Convert WeChat Official Account articles (mp.weixin.qq.com) into clean, high-quality Markdown. The skill uses a Python script optimized for WeChat's unique DOM structure, featuring deep noise removal, smart code block detection, rich text preservation, and intelligent paragraph formatting.

Workflow

Decision Tree

User provides WeChat article URL?
├── Yes → Go to Step 1: Install Dependencies & Run Script
├── User wants to convert HTML directly?
│   └── Use Step 2: In-Line Conversion (for fetched HTML)
└── User asks about multiple URLs?
    └── Use batch mode with -f flag

Step 1: Install Dependencies & Convert

  1. Ensure Python dependencies are available. Install if missing:

    pip install requests beautifulsoup4 markdownify
    
  2. Run the conversion script:

    python scripts/wechat_to_md.py "<WECHAT_URL>" -o "<OUTPUT_DIR>"
    

    Options:

    • --no-images — Skip image downloading, keep remote URLs
    • --no-frontmatter — Omit YAML frontmatter
    • Multiple URLs: python scripts/wechat_to_md.py url1 url2 url3
  3. The output structure:

    <OUTPUT_DIR>/
    └── <Article_Title>/
        ├── <Article_Title>.md
        └── images/
            ├── img_000.png
            └── img_001.jpg
    

Step 2: In-Line Conversion (for Pre-Fetched HTML)

If the HTML has already been fetched (e.g., via web_fetch), use the script's convert_simple() function programmatically:

import sys
sys.path.insert(0, "<SKILL_DIR>/scripts")
from wechat_to_md import convert_simple

# 基础用法:仅转换,不下载图片
result = convert_simple("https://mp.weixin.qq.com/s/xxxxx")
markdown = result["markdown"]       # Full Markdown string
metadata = result["metadata"]       # {title, author, date, url, ...}
code_blocks = result["code_blocks"] # [{lang, code}, ...]
image_urls = result["image_urls"]   # 原始图片 URL 列表

# 高级用法:同时下载图片到本地
result = convert_simple(
    "https://mp.weixin.qq.com/s/xxxxx",
    download_imgs=True,              # 启用图片下载
    output_dir="./my_article"        # 指定输出目录(可选)
)
markdown = result["markdown"]        # 图片链接已替换为本地路径
image_mapping = result["image_mapping"]  # URL -> 本地路径映射
output_dir = result["output_dir"]    # 实际输出目录

Return the Markdown content directly to the user or write it to a file.

Step 3: Present Results

  • Display the generated Markdown file path to the user.
  • If the user wants to review the content, read the .md file and present a summary.
  • For batch conversions, report success/failure count.

Core Capabilities

1. Deep Noise Removal (WeChat-Specific)

The script removes 30+ WeChat-specific noise elements including:

  • Ad banners and promotional content (.mp_profile_iframe, #ad_content)
  • QR codes and reward/tip areas (.reward_area, .qr_code_pc)
  • Comment sections (#comment_container, #js_cmt_area)
  • Audio/video players (mpvoice, mpvideo)
  • Related article recommendations (#relation_article)
  • Tool bars, footers, copyright areas, tag sections
  • Hidden elements (display:none, visibility:hidden)
  • Empty <span> placeholders

2. Smart Code Block Detection

Handles all 3 WeChat code block formats:

  • pre.code-snippet with data-lang attribute
  • .code-snippet__fix container with nested pre[data-lang]
  • Generic pre[data-lang]

Features:

  • Auto-detects programming language from data-lang, CSS class, and code content
  • Removes line numbers (.code-snippet__line-index)
  • Filters CSS counter leaks (counter(line) garbage text)
  • Uses placeholder strategy: extract code blocks before conversion, restore after
  • Supports 25+ languages: Python, JavaScript, TypeScript, Go, Rust, Java, C, C++, SQL, HTML, CSS, JSON, YAML, Shell, Dockerfile, etc.

3. Rich Text Preservation

  • Bold/Italic: Normalizes <b><strong>, <i><em>, handles inline font-weight: bold
  • Lists: Converts WeChat marker-based lists (, ·, 1., (1)) to proper Markdown lists
  • Blockquotes: Detects left-border styled sections as blockquotes
  • Tables: Preserves table structure
  • Links: Preserves article links
  • Headings: Detects font-size based headings (≥22px → H2, ≥19px → H3)

4. Intelligent Paragraph Formatting

  • Fixes lazy-loaded images (data-srcsrc)
  • Cleans HTML entity residuals (&nbsp; → space, zero-width spaces removed)
  • Collapses excessive blank lines (max 2 consecutive)
  • Trims trailing whitespace per line
  • Proper spacing around code blocks
  • Full-width spaces → half-width spaces

5. Metadata Extraction

Generates YAML frontmatter:

---
title: "Article Title"
author: "Account Name"
date: "2026-04-08"
source: "https://mp.weixin.qq.com/s/xxxxx"
description: "Article description if available"
---

6. Image Handling

  • 自动下载:下载所有文章图片到 images/ 子目录
  • 并发下载:默认 5 个并发线程,支持重试机制(默认重试 2 次)
  • 格式检测:从 URL 和 Content-Type 自动检测图片格式
  • 链接替换:自动将 Markdown 中的远程 URL 替换为本地相对路径 (images/img_000.png)
  • URL 变体处理:智能处理微信图片 URL 的不同查询参数变体
  • 失败回退:下载失败时保留原始远程 URL
  • 文件验证:验证下载文件大小(过滤小于 100 字节的损坏文件)

图片下载增强功能:

# 下载图片并获取映射关系
from wechat_to_md import download_images, replace_image_urls

# 下载图片
url_to_local = download_images(
    img_urls=["https://mmbiz.qpic.cn/..."],
    output_dir=Path("./output"),
    concurrency=5,    # 并发数
    timeout=30,       # 超时时间(秒)
    retries=2         # 重试次数
)

# 替换 Markdown 中的图片链接
md = replace_image_urls(markdown, url_to_local)

Error Handling

ErrorCauseResolution
NetworkErrorHTTP failure, timeout, 404Retries 3x with exponential backoff
CaptchaErrorCaptcha page detectedInform user to wait and retry
ParseErrorContent element not foundCheck URL validity, may be restricted article
Missing dependenciespip install not runInstall: pip install requests beautifulsoup4 markdownify

Important Notes

  • Only supports mp.weixin.qq.com domain articles
  • Some code blocks are rendered as images/SVG — their source code cannot be extracted
  • Captcha pages may appear under high-frequency access; wait and retry
  • Public articles only — login-gated articles cannot be fetched
  • Respect original author copyright; for personal study/archiving use only

References

For detailed WeChat article DOM structure, selectors, and element handling, refer to:

  • references/wechat-dom-reference.md — Complete WeChat DOM structure documentation

Comments

Loading comments...