微信公众号文章抓取
ReviewAudited by ClawScan on May 10, 2026.
Overview
The skill mostly matches its stated scraping/report purpose, but unsafe input handling could make it write files in unintended locations or run web/model-supplied content while creating PDFs.
Install only if you are comfortable with Playwright-based scraping of Sogou/WeChat pages and local report generation. Before use, prefer safe plain-text keywords, review articles_new.json before fetching PDFs, and update the scripts to validate URLs, escape HTML content, disable JavaScript during report rendering, and constrain all output paths to the workspace.
Findings (6)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
A malicious article could include instructions that try to distract or redirect the agent while it is summarizing.
The current session model is asked to process article text scraped from external websites and create a file used by later steps. That is purpose-aligned, but article content should be treated strictly as untrusted data.
读取 `articles.json` 中的文章 ... 用当前模型逐篇生成100-200字摘要 ... 写入 `articles_new.json`
Add an explicit instruction to ignore commands inside article text, summarize only the article content, and preserve original metadata fields exactly.
If articles_new.json is corrupted or prompt-injected, the browser could be sent to non-WeChat, local, or private-network URLs and save their content as PDFs.
The fetch script trusts URLs from articles_new.json and navigates Chromium to them without scheme, domain, localhost, or file URL validation.
ARTICLES_FILE = f'{WORKSPACE}/articles_new.json' ... url = art.get('url', '') ... await page.goto(url, wait_until='networkidle')Validate that URLs are expected WeChat/Sogou HTTPS domains, reject file://, localhost, and private-network targets, and ask for confirmation before fetching unexpected URLs.
A crafted keyword containing ../ or path separators could cause report files to be written outside the intended workspace, potentially overwriting user-writable files with matching names.
The raw user-supplied keyword is used in output file paths without removing path separators or checking that the resolved path stays inside the workspace.
keyword = sys.argv[1] ... html_file = f'{WORKSPACE}/{keyword}_行业动态.html' ... pdf_file = f'{WORKSPACE}/{keyword}_行业动态.pdf'Convert the keyword to a safe filename slug, remove path separators, and verify the resolved output path remains under the intended workspace directory.
Malicious article text or a model-generated summary containing HTML or JavaScript could alter the report or execute script during PDF generation, including making network requests from the browser.
Article titles, sources, summaries, and keyword-derived content are inserted into generated HTML without HTML escaping, then opened in Chromium to produce a PDF.
<h3>{i}. {art['title']}</h3> ... {art.get('summary', '摘要生成中...')} ... await page.goto(f'file://{abs_html}')Escape all untrusted fields with html.escape(..., quote=True), disable JavaScript for the local report-rendering browser context, and sanitize summaries before writing HTML.
The target sites may treat this as automated scraping and may show CAPTCHA, block requests, or have usage-policy implications.
The scraper uses Playwright browser automation with an anti-automation-detection flag. This is related to the stated scraping purpose, but it is a behavior users should understand.
browser = await p.chromium.launch(headless=False, args=['--disable-blink-features=AutomationControlled'])
Use the skill only where scraping is allowed, avoid bypassing access controls, and keep manual review for CAPTCHA or blocked pages.
Installing unpinned packages or browser binaries can change over time and may behave differently depending on the package source at install time.
The dependency setup is user-directed but unpinned, and the registry has no install spec declaring or constraining these installs.
pip install playwright requests ... playwright install chromium
Pin package versions, prefer a virtual environment, and document the Playwright browser install in a proper install spec.
