WeChat Work Doc Fetcher

v1.0.0

Fetch and convert WeChat Work developer docs pages into clean Markdown files for use in Obsidian, handling SPA content and required authentication.

⭐ 0· 615·4 current·4 all-time

by@mouzhi

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for mouzhi/wecom-doc-fetcher.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "WeChat Work Doc Fetcher" (mouzhi/wecom-doc-fetcher) from ClawHub.
Skill page: https://clawhub.ai/mouzhi/wecom-doc-fetcher
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Canonical install target

openclaw skills install mouzhi/wecom-doc-fetcher

ClawHub CLI

Package manager switcher

npx clawhub@latest install wecom-doc-fetcher

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

The code and SKILL.md align with the stated purpose: they fetch developer.work.weixin.qq.com content_md and clean it for Obsidian. Requiring a session cookie for authenticated pages is expected. However, the README/SKILL.md claim that Playwright 'obtains session cookies automatically — no manual cookie setup needed' is misleading: get_doc_id_via_playwright only extracts doc_id and does not transfer Playwright/browser cookies into the requests.Session used for the actual API POST.

Instruction Scope

Instructions ask users to install Playwright/Chromium and optionally paste browser cookies. The runtime SKILL.md implies Playwright will both find doc_id and handle authentication automatically; the script only uses Playwright to intercept the XHR and extract doc_id. After that, the requests.Session uses COOKIES_RAW or --cookies. This mismatch could lead users to believe no manual cookie handling is needed and either share cookies unnecessarily or fail to get content_md unexpectedly.

ℹ

Install Mechanism

This is an instruction-only skill (no automated install spec). SKILL.md instructs users to pip install playwright and run `playwright install chromium`, which will download a ~150 MB headless Chromium binary from Playwright's release infrastructure. That download is large but expected for browser automation; there is no hidden or unusual external installer in the skill bundle itself.

ℹ

Credentials

The skill declares no required env vars or credentials in registry metadata, which matches the code. However the tool requires session cookies for authenticated API access; those are sensitive (session id / JWT) and the script provides a COOKIES_RAW variable and a --cookies flag to accept them. Requiring cookies is proportionate to the task, but handing them to the script is a sensitive operation and should be done deliberately.

✓

Persistence & Privilege

The skill does not request permanent inclusion, does not modify other skills or system configuration, and does not persist beyond writing the requested markdown file. It runs as an on-demand script and does not elevate privileges.

What to consider before installing

Key points to consider before installing/using: - The tool needs an authenticated session cookie to fetch protected pages. The SKILL.md's wording that Playwright 'gets session cookies automatically' is misleading — the script uses Playwright only to extract doc_id and does not transfer browser cookies into the requests.Session. You will usually need to supply cookies via --cookies or by editing COOKIES_RAW. Treat those cookies like passwords: only paste them into the script on machines you trust, and consider revoking the session after use. - Playwright requires installing a headless Chromium (~150 MB). Install it only if you accept that download and run browser automation locally. - The script only contacts developer.work.weixin.qq.com (no other remote endpoints). You can verify network calls by reviewing the code (fetch_doc uses a single POST to the site) or by running the script in a network-monitored/isolated environment. - If you want the advertised 'automatic' behavior (no manual cookie paste), you or the author would need to modify the script to extract cookies from Playwright and transfer them into the requests.Session before calling the API; as-is, the documentation overpromises. - If you are uncomfortable pasting session cookies into a script, use the manual fallback to get doc_id and then query the API using a browser-exported curl only on an environment you control, or ask the author to add Playwright cookie transfer or OAuth support. Run the script in an isolated environment (container/VM) if possible.

Like a lobster shell, security has layers — review code before you run it.

latestvk976dpmf3yep8km704r0k5waz181mtab

615downloads

0stars

1versions

Updated 3h ago

v1.0.0

MIT-0

wecom-doc-fetcher

Use this skill when the user wants to save any page from the WeChat Work (企业微信) developer documentation site (developer.work.weixin.qq.com/document/path/*) as a clean Markdown file in their Obsidian vault.

Files in this skill

wecom-doc-fetcher/
├── SKILL.md          # this file
└── wx_doc_fetch.py   # the fetch & convert script

Setup (one-time)

Run these once before using the skill:

pip install requests playwright
playwright install chromium

playwright install chromium downloads a ~150 MB headless Chromium binary. This is required for automatic doc_id detection.

Python 3.8+ is required.

Usage

Place wx_doc_fetch.py anywhere convenient (e.g. your vault's scripts folder), then run:

# Basic: auto-detect doc_id, print to stdout
python wx_doc_fetch.py <URL>

# Save to file
python wx_doc_fetch.py <URL> output.md

# Skip Playwright, supply doc_id manually
python wx_doc_fetch.py <URL> output.md --doc-id <integer>

# Override cookies at runtime
python wx_doc_fetch.py <URL> output.md --cookies "wwapidoc.sid=xxx; ..."

Example

python wx_doc_fetch.py https://developer.work.weixin.qq.com/document/path/94677 发送消息.md
# [info] path_id=94677  doc_id=31152
# [done] 已写入：发送消息.md

How It Works

The WeChat Work docs site is a Vue SPA — the visible content is not in the initial HTML. It is loaded at runtime via a private POST API:

POST https://developer.work.weixin.qq.com/docFetch/fetchCnt?lang=zh_CN&ajax=1&f=json
Body: doc_id=<integer>   (application/x-www-form-urlencoded)

The response includes data.content_md — the page content as a Markdown string. The script fetches this field, cleans it, and writes the result.

Why not WebFetch / defuddle?

The page renders client-side. WebFetch and defuddle only see the pre-JS HTML skeleton — no content. Scraping innerText via browser tools works but produces a very large accessibility tree with poor formatting. The content_md API field is the cleanest, most token-efficient source.

URL path ID ≠ doc_id

The number in the browser URL (e.g. 94677) is a routing slug — not the doc_id the API needs. The actual doc_id (e.g. 31152) is determined at runtime by loading the page with Playwright and intercepting the fetchCnt XHR request.

Manual doc_id Fallback

If Playwright is unavailable or times out:

Open the target URL in Chrome
DevTools → Network tab → filter by fetchCnt
Click the request → Payload tab
Read the doc_id value
Pass it with --doc-id:

python wx_doc_fetch.py https://developer.work.weixin.qq.com/document/path/94677 发送消息.md --doc-id 31152

Cookie Configuration

The fetchCnt API requires an authenticated session. Playwright's headless browser obtains session cookies automatically when loading the page — no manual cookie setup needed for normal use.

If you see errCode: -30001 in the output, the session is rejected. Fix:

Open the site in Chrome while logged in
DevTools → Network → any fetchCnt request → Copy as cURL
Find the -b '...' cookie string in the copied command
Either paste it into COOKIES_RAW at the top of wx_doc_fetch.py, or pass it via --cookies "..."

Key cookies and their lifetimes:

Cookie	Purpose	Lifetime
`wwapidoc.sid`	Session identifier	~24 hours
`wwapidoc.token_wt`	JWT auth token	~30 minutes

API Reference

Item	Detail
Endpoint	`POST /docFetch/fetchCnt?lang=zh_CN&ajax=1&f=json&random=<rand>`
Body	`doc_id=<integer>` (form-urlencoded)
Auth	Session cookies
Key response field	`data.content_md`
Other response fields	`data.content_html`, `data.content_html_v2`, `data.content_txt`, `data.title`, `data.time`

content_md Cleaning Rules

The content_md field is mostly valid CommonMark but has site-specific issues. The clean_md() function in wx_doc_fetch.py handles all of them:

#	Problem	Raw example	After cleaning
1	`[TOC]` marker at top	`[TOC]\n# 概述`	`# 概述`
2	Heading missing space after `#`	`##接口定义`	`## 接口定义`
3	Internal numeric anchor links	`[接收事件](#12977)`	`接收事件`
3	Anchors with sub-path	`[开启API](#31106/如何开启API)`	`开启API`
4	HTML line breaks inside table cells	`说明</br>补充`	`说明补充`
5	`<b>` bold tags	`<b>注意</b>`	`注意`
6	`<code>` inline tags	`<code>open_kfid</code>`	`open_kfid`
7	`<font>` color tags	`<font color="red">警告</font>`	`警告`
8	`!!#rrggbb text!!` site-specific highlight	`!!#ff0000 重要!!`	`重要`
9	Leading spaces before table rows	`··\| 参数 \|`	`\| 参数 \|`
10	No blank line before table (Obsidian won't render)	`文字\n\| col \|`	`文字\n\n\| col \|`
11	Excess blank lines	3+ `\n` in a row	2 `\n` max

Rule 10 — critical regex note

The blank-line-before-table rule must match on lines that don't start with |, not just on the trailing character of the previous line:

# CORRECT — matches on start of line, avoids breaking table rows apart
re.sub(r"^([^|\n][^\n]*)\n(\|)", r"\1\n\n\2", content, flags=re.MULTILINE)

# WRONG — table rows end with "| " (trailing space), so last char is space,
#          causing blank lines to be inserted between every table row
re.sub(r"([^\n])\n(\|)", r"\1\n\n\2", content)

Comments

Loading comments...