Deep Web Fetcher
Fetch and extract structured content from JS-rendered web pages, including main text, metadata, and key domain-specific metrics, without paid APIs.
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 108 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
Name/description, SKILL.md, and scripts/web-fetcher.py align: the script launches Playwright, renders JS pages, runs Readability and regex extraction for metrics. No unrelated credentials, binaries, or config paths are requested.
Instruction Scope
SKILL.md and the script stay within scraping and extraction. The docs explicitly recommend proxies, rotating user agents, and increasing delays to bypass anti-bot protections — these are legitimate for robust scraping but are also techniques that can be misused. The skill does not read local files or environment variables beyond what a normal script would use, and it prints JSON to stdout.
Install Mechanism
There is no packaged install spec; SKILL.md instructs pip installs and running 'playwright install chromium' (standard for Playwright). No downloads from untrusted hosts or embedded binaries are present in the bundle.
Credentials
The skill declares no required environment variables or credentials and the code does not attempt to access secrets or external auth tokens. The lack of credentials is proportionate to the stated local-scrape purpose.
Persistence & Privilege
always is false, the skill does not request persistent/system-wide changes, and the script does not modify other skills or agent configuration. It runs as a normal one-shot tool.
Assessment
This skill appears to do what it says: run a headless Chromium via Playwright, extract article text and simple metrics, and print structured JSON. Before installing/running: 1) run it in an isolated environment (virtualenv/container) because Playwright will download browser binaries and the tool will execute JS from arbitrary sites; 2) ensure you have legal permission to scrape your targets and avoid aggressive concurrency to reduce IP blocking; 3) be cautious if you choose to configure proxies or automation to bypass anti-bot protections (those are documented but could be abused); 4) review and pin dependency versions (playwright, readability-lxml) before pip installing; and 5) if you will chain outputs to other services, remember the SKILL.md claim that "data does not leave the machine" only holds if you don't forward the JSON elsewhere. Overall the package is internally consistent and contains no obvious covert exfiltration, but follow standard operational safety and legal/ethical scraping practices.Like a lobster shell, security has layers — review code before you run it.
Current versionv1.0.0
Download zipfetchlatestresearchscrapeweb
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
SKILL.md
Skill: Deep Web Fetcher
版本:1.0.0
描述:免费网页抓取 + 内容提取 + 结构化输出,无需付费API
核心功能
- 网页抓取:支持JS渲染,自动等待页面加载
- 正文提取:智能识别文章主体,过滤广告/导航
- 元数据提取:自动提取标题、作者、发布时间
- 指标提取:从正文提取关键数据(样本量、AUC、成本等)
触发命令
/web-fetcher <url> [--domain <领域>]
参数说明
| 参数 | 默认值 | 说明 |
|---|---|---|
url | 必填 | 目标网页URL |
--domain | general | 研究领域,影响指标提取规则 |
领域选项
general:通用提取healthcare:医疗/健康领域medical:医学研究insurance:保险控费machine_learning:机器学习
执行流程
1. 启动Playwright浏览器
2. 访问目标URL,等待JS渲染完成
3. 使用Readability提取正文
4. 提取元数据(标题、作者、时间)
5. 根据领域规则提取关键指标
6. 输出生成JSON
输出格式
{
"url": "https://example.com/article",
"success": true,
"title": "文章标题",
"author": "作者名",
"published_date": "2024-01-15",
"content_text": "正文内容...",
"content_html": "<html>...</html>",
"word_count": 1500,
"extracted_metrics": {
"sample_size": "9,080",
"auc": 0.85,
"accuracy": 92.5
},
"error": null
}
使用示例
抓取arXiv论文
/web-fetcher "https://arxiv.org/abs/2301.12345" --domain "machine learning"
抓取PubMed摘要
/web-fetcher "https://pubmed.ncbi.nlm.nih.gov/38134648/" --domain "medical"
抓取政府报告
/web-fetcher "https://www.gov.cn/zhengce/zhengceku/2024-01/15/content_6923456.htm" --domain "insurance"
依赖安装
# 安装Python依赖
pip install playwright readability-lxml lxml beautifulsoup4
# 安装浏览器驱动(首次运行需下载~100MB)
playwright install chromium
注意事项
反爬策略
部分网站有反爬机制,如遇失败可:
- 增加延迟:在脚本中调整
time.sleep() - 使用代理:在
browser.new_context()中添加代理 - 轮换UA:修改
user_agent参数
提取准确率
- 标准网页(文章/博客):✅ 效果优秀
- 复杂布局(多栏/动态加载):⚠️ 可能需人工复核
- PDF页面:❌ 不支持,请用PDF专用工具
执行速度
- 单页抓取:5-15秒(含浏览器启动)
- 批量抓取:建议并发3-5个
与深度研究v6.0集成
# 生成卡片
/web-fetcher <url> --domain "insurance" > sources/card-xxx.json
# 转换卡片格式
python3 scripts/convert-to-card.py sources/card-xxx.json
文件结构
skills/web-fetcher/
├── SKILL.md
└── scripts/
└── web-fetcher.py
版本历史
| 版本 | 日期 | 更新 |
|---|---|---|
| 1.0.0 | 2026-03-19 | 初始版本 |
完全免费,本地运行,数据不出机器
Files
2 totalSelect a file
Select a file to preview.
Comments
Loading comments…
