微信公众号文章抓取

ReviewAudited by ClawScan on May 10, 2026.

Overview

The skill mostly matches its stated scraping/report purpose, but unsafe input handling could make it write files in unintended locations or run web/model-supplied content while creating PDFs.

Install only if you are comfortable with Playwright-based scraping of Sogou/WeChat pages and local report generation. Before use, prefer safe plain-text keywords, review articles_new.json before fetching PDFs, and update the scripts to validate URLs, escape HTML content, disable JavaScript during report rendering, and constrain all output paths to the workspace.

Findings (6)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

What this means

A malicious article could include instructions that try to distract or redirect the agent while it is summarizing.

Why it was flagged

The current session model is asked to process article text scraped from external websites and create a file used by later steps. That is purpose-aligned, but article content should be treated strictly as untrusted data.

Skill content
读取 `articles.json` 中的文章 ... 用当前模型逐篇生成100-200字摘要 ... 写入 `articles_new.json`
Recommendation

Add an explicit instruction to ignore commands inside article text, summarize only the article content, and preserve original metadata fields exactly.

What this means

If articles_new.json is corrupted or prompt-injected, the browser could be sent to non-WeChat, local, or private-network URLs and save their content as PDFs.

Why it was flagged

The fetch script trusts URLs from articles_new.json and navigates Chromium to them without scheme, domain, localhost, or file URL validation.

Skill content
ARTICLES_FILE = f'{WORKSPACE}/articles_new.json' ... url = art.get('url', '') ... await page.goto(url, wait_until='networkidle')
Recommendation

Validate that URLs are expected WeChat/Sogou HTTPS domains, reject file://, localhost, and private-network targets, and ask for confirmation before fetching unexpected URLs.

What this means

A crafted keyword containing ../ or path separators could cause report files to be written outside the intended workspace, potentially overwriting user-writable files with matching names.

Why it was flagged

The raw user-supplied keyword is used in output file paths without removing path separators or checking that the resolved path stays inside the workspace.

Skill content
keyword = sys.argv[1] ... html_file = f'{WORKSPACE}/{keyword}_行业动态.html' ... pdf_file = f'{WORKSPACE}/{keyword}_行业动态.pdf'
Recommendation

Convert the keyword to a safe filename slug, remove path separators, and verify the resolved output path remains under the intended workspace directory.

What this means

Malicious article text or a model-generated summary containing HTML or JavaScript could alter the report or execute script during PDF generation, including making network requests from the browser.

Why it was flagged

Article titles, sources, summaries, and keyword-derived content are inserted into generated HTML without HTML escaping, then opened in Chromium to produce a PDF.

Skill content
<h3>{i}. {art['title']}</h3> ... {art.get('summary', '摘要生成中...')} ... await page.goto(f'file://{abs_html}')
Recommendation

Escape all untrusted fields with html.escape(..., quote=True), disable JavaScript for the local report-rendering browser context, and sanitize summaries before writing HTML.

What this means

The target sites may treat this as automated scraping and may show CAPTCHA, block requests, or have usage-policy implications.

Why it was flagged

The scraper uses Playwright browser automation with an anti-automation-detection flag. This is related to the stated scraping purpose, but it is a behavior users should understand.

Skill content
browser = await p.chromium.launch(headless=False, args=['--disable-blink-features=AutomationControlled'])
Recommendation

Use the skill only where scraping is allowed, avoid bypassing access controls, and keep manual review for CAPTCHA or blocked pages.

What this means

Installing unpinned packages or browser binaries can change over time and may behave differently depending on the package source at install time.

Why it was flagged

The dependency setup is user-directed but unpinned, and the registry has no install spec declaring or constraining these installs.

Skill content
pip install playwright requests ... playwright install chromium
Recommendation

Pin package versions, prefer a virtual environment, and document the Playwright browser install in a proper install spec.