闲鱼数据抓取
闲鱼数据抓取技能。使用 Playwright + OCR 技术突破反爬虫,抓取闲鱼商品数据(标题、价格、想要人数等),自动上传截图和数据到 Gitee 仓库。支持批量关键词搜索、竞品分析、市场调研。
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 17 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
OpenClaw
Suspicious
medium confidencePurpose & Capability
Name/description match the code: Playwright + Tesseract OCR + uploader scripts implement scraping, OCR, reporting and Gitee upload. One minor metadata inconsistency: the registry 'Required binaries' in the provided metadata shows placeholders ([object Object]) while SKILL.md and the code clearly require node, python3, tesseract and playwright.
Instruction Scope
SKILL.md and scripts instruct the agent/user to store a Gitee personal access token and an (optional) 闲鱼 login cookie in a local config file and to run scripts that will: launch headless browsers, take full-page screenshots, perform OCR, and (optionally) upload data and screenshots to Gitee. Those actions are expected for the stated purpose, but the instructions also: (a) ask the user to place sensitive tokens/cookies in a file in the home workspace (no enforced permission setting in install.sh), (b) include explicit commands to configure cron entries for repeated autonomous scraping and upload, and (c) the INSTALL.md suggests piping a remote install.sh via curl|bash (a high-risk pattern). The skill does not instruct reading system credentials or unrelated config files, but the automatic scheduling and remote-install suggestions broaden the operational scope and risk.
Install Mechanism
The package contains an install.sh that performs system package installs (apt-get/yum), pip/npm installs, npx playwright install, and attempts to install cron jobs into the system crontab. INSTALL.md also advertises a curl -sL https://raw.githubusercontent.com/your-username/xianyu-data-grabber/main/install.sh | bash pattern. Although raw.githubusercontent.com is a common release host, piping an external script to bash is risky unless the URL is a verified official release; here the URL uses a placeholder 'your-username' which is ambiguous. The included install.sh will modify system state (packages, crontab) and requires elevated permissions to succeed—this increases risk and requires the user to review the install script carefully before execution.
Credentials
The skill requests a Gitee token (for repository file create/update) and an optional site login cookie (to improve scraping success). Those credentials are proportionate to the claimed functionality. However, the cookie and token are sensitive; the skill's docs claim storing them with permission 600, but the shipped install.sh does not explicitly set secure permissions on the config file. Also the registry metadata listing of required env vars is malformed (placeholders), which is an inconsistency to be aware of.
Persistence & Privilege
The skill itself is not marked always:true, but the provided install scripts (install.sh, cron-setup.sh) create and install cron jobs that will run scraping, report generation and uploads on a schedule (daily/weekly). That gives the skill ongoing system presence and the ability to perform repeated network access and uploads. This level of persistence is expected for a scheduler-based scraper but is a material privilege: if you install, the system will run periodic automated scraping and (if configured) upload to Gitee without further prompts. Review/consent should be explicit before enabling.
What to consider before installing
Key things to consider before installing or running this skill:
- Credentials: The skill asks for a Gitee personal access token and (optionally) a site login cookie. Only provide a token with the minimal scopes required (projects/repo write) and consider creating a dedicated repository/account for uploads. Avoid providing long-lived credentials you use elsewhere.
- Inspect install scripts: Do not run curl | bash on an untrusted URL. Inspect install.sh, uploader.sh and cron-setup.sh locally before executing. The included install.sh will install system packages, pip/npm modules and attempt to add cron jobs—these require elevated privileges and change system state.
- Cron/persistence: The skill's installer will (by default) add scheduled tasks that run scraping and uploads automatically. If you do not want autonomous recurring scraping, do not install the crontab entries or remove them after install.
- Private data and file permissions: The docs claim token/cookie are stored with 600 permissions, but the install script does not enforce this. After creating the config file, set permissions (chmod 600 ~/.openclaw/workspace/.xianyu-grabber-config.json) and consider restricting who can read the workspace directory.
- Upload destination: The uploader will push screenshots and data to a Gitee repo. Confirm the uploader.sh targets your intended repo/owner and that you trust that destination. If you prefer not to upload, leave uploadToGitee=false or avoid supplying a token.
- Run in isolation for first test: If possible, run the skill in a disposable VM or container so you can observe network activity, filesystem changes, and cron modifications before trusting it on a production host.
- Review truncated/omitted files: The repository included many scripts; some files were truncated in the review. Inspect any additional scripts (uploader.sh, run.sh, update.sh) for unexpected network endpoints, hardcoded URLs, or behaviour before full deployment.
If you want, I can: (a) summarize the contents of uploader.sh/run.sh/update.sh and any omitted files to check for hidden endpoints or unexpected behavior; (b) show the exact crontab entries and recommend safer alternatives; or (c) suggest a minimal manual installation checklist to reduce risk.grabber-enhanced.js:196
Shell command execution detected (child_process).
grabber.js:163
Shell command execution detected (child_process).
Patterns worth reviewing
These patterns may indicate risky behavior. Check the VirusTotal and OpenClaw results above for context-aware analysis before installing.Like a lobster shell, security has layers — review code before you run it.
Current versionv1.0.0
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
Runtime requirements
Bins[object Object], [object Object], [object Object], [object Object]
Env[object Object], [object Object], [object Object], [object Object]
SKILL.md
闲鱼数据抓取技能 (xianyu-data-grabber)
功能描述
使用 Playwright + OCR 技术突破闲鱼反爬虫,抓取商品数据并自动上传到 Gitee 仓库。
核心能力:
- 批量关键词搜索(支持 15+ 关键词)
- 自动截图保存(PNG 格式)
- OCR 文字识别(中文 + 英文)
- 商品信息提取(标题、价格、想要人数)
- 自动生成分析报告(Markdown + JSON)
- 自动上传到 Gitee 仓库
什么时候使用
当用户提到以下场景时,激活此技能:
- 「帮我抓取闲鱼上的竞品数据」
- 「调研闲鱼上某某类目的商品」
- 「分析闲鱼头部卖家的定价策略」
- 「闲鱼市场调研」
- 「抓取闲鱼商品价格和销量」
- 「xianyu research」/「闲鱼数据抓取」
- 「看看闲鱼上这个东西卖得怎么样」
- 「闲鱼竞品分析」
配置文件
1. 基础配置
创建 ~/.openclaw/workspace/.xianyu-grabber-config.json:
{
"gitee": {
"token": "your_gitee_token",
"owner": "your_username",
"repo": "xianyu-data"
},
"xianyu": {
"cookie": "your_xianyu_cookie"
},
"grabber": {
"keywords": ["Magisk", "KernelSU", "手机维修"],
"screenshotDir": "legion/screenshots",
"dataDir": "legion/data",
"uploadToGitee": true,
"ocrLanguage": "chi_sim+eng"
}
}
2. Gitee 个人访问令牌
获取方式:
- 登录 https://gitee.com
- 设置 → 个人访问令牌
- 创建新令牌(勾选
projects权限) - 复制令牌到配置文件
3. 闲鱼 Cookie(可选)
获取方式:
- 浏览器登录闲鱼
- F12 开发者工具 → Network
- 刷新页面 → 复制 Cookie 字段
使用方法
基础用法
# 抓取单个关键词
xianyu-data-grabber search "Magisk"
# 抓取多个关键词
xianyu-data-grabber search "Magisk" "KernelSU" "root"
# 使用配置文件中的关键词列表
xianyu-data-grabber search --config
高级用法
# 指定输出目录
xianyu-data-grabber search "Magisk" --output ./my-research
# 不上传 Gitee
xianyu-data-grabber search "Magisk" --no-upload
# 仅 OCR 不抓取(已有截图)
xianyu-data-grabber ocr --input ./screenshots
# 生成汇总报告
xianyu-data-grabber report --input ./data
# 上传到 Gitee
xianyu-data-grabber upload --all
通过消息技能调用
帮我抓取闲鱼上"Magisk"相关的商品数据
调研闲鱼手机维修类目的竞品
分析闲鱼 root 服务的定价策略
输出文件
截图文件
legion/screenshots/xianyu-{keyword}.png- 全页面截图(高度可能超过 10000px)
数据文件
| 文件 | 格式 | 内容 |
|---|---|---|
xianyu-{keyword}.json | JSON | 单个关键词原始数据 |
xianyu-full-data.json | JSON | 所有关键词汇总 |
xianyu-summary.md | Markdown | 汇总报告 |
xianyu-analysis.md | Markdown | 深度分析报告 |
Gitee 仓库结构
xianyu-data/
├── README.md # 自动生成的说明
├── data/
│ ├── xianyu-full-data.json
│ └── xianyu-{keyword}.json
├── screenshots/
│ └── xianyu-{keyword}.png
├── reports/
│ ├── xianyu-summary.md
│ └── xianyu-analysis.md
└── upload-{timestamp}.md # 上传记录
核心脚本
grabber.js - 主抓取脚本
// 1. 启动 Playwright 浏览器(Headless + 伪装)
// 2. 加载 Cookie(如有)
// 3. 遍历关键词搜索
// 4. 截图保存
// 5. 调用 OCR 识别
// 6. 提取商品信息
// 7. 保存 JSON 数据
ocr.py - OCR 识别脚本
# 1. 读取截图
# 2. Tesseract OCR 识别
# 3. 提取价格/想要人数等
# 4. 输出结构化数据
uploader.sh - Gitee 上传脚本
# 1. 调用 Gitee API
# 2. 创建/更新文件
# 3. 提交 commit
# 4. 返回上传结果
数据格式
单个商品数据
{
"keyword": "Magisk",
"products": [
{
"title": "Magisk 模块合集 17G 资源",
"price": "1.00",
"wants": "628 人想要",
"seller": "卖家信用优秀",
"tags": ["24h 自动发货", "包邮"]
}
],
"timestamp": "2026-03-20T06:00:00+08:00",
"screenshot": "screenshots/xianyu-Magisk.png"
}
汇总报告结构
# 闲鱼数据调研报告
## 关键词:Magisk
- 商品数:19 个
- 价格区间:¥1-50 元
- 热门商品:...
## 关键词:KernelSU
...
## 价格分析
...
## 竞品分析
...
DEBUG 指引
日志位置
| 日志 | 文件 |
|---|---|
| 抓取日志 | logs/xianyu-grabber.log |
| OCR 日志 | logs/xianyu-ocr.log |
| 上传日志 | logs/xianyu-upload.log |
| 错误日志 | logs/xianyu-error.log |
常见问题
1. 截图显示「非法访问」
原因: 反爬虫检测到自动化
解决:
# 1. 更新 Cookie
# 2. 降低抓取速度(增加延迟)
# 3. 减少并发关键词数量
2. OCR 识别结果为空
原因: Tesseract 未安装或语言包缺失
解决:
# 安装 Tesseract
apt-get install tesseract-ocr tesseract-ocr-chi-sim
# 验证安装
tesseract --version
tesseract --list-langs
3. Gitee 上传失败
原因: Token 无效或权限不足
解决:
# 1. 检查 Token 是否有效
curl -H "Authorization: Bearer YOUR_TOKEN" https://gitee.com/api/v5/user
# 2. 检查仓库权限
# 确保 Token 有 projects 权限
4. Playwright 浏览器启动失败
原因: 缺少依赖或浏览器未安装
解决:
# 安装 Playwright 浏览器
npx playwright install chromium
# 安装系统依赖
apt-get install libnss3 libnspr4 libatk1.0-0 \
libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 \
libxcomposite1 libxdamage1 libxfixes3 libxrandr2 \
libgbm1 libasound2 libpango-1.0-0 libcairo2
测试命令
# 测试配置
xianyu-data-grabber test-config
# 测试 OCR
xianyu-data-grabber test-ocr --image test.png
# 测试 Gitee 上传
xianyu-data-grabber test-upload --file test.txt
# 完整测试
xianyu-data-grabber test --all
定时任务
每日自动抓取
# crontab -e
0 9 * * * cd ~/.openclaw/workspace && node skills/xianyu-data-grabber/grabber.js --config --upload >> logs/xianyu-cron.log 2>&1
每周生成报告
0 10 * * 1 cd ~/.openclaw/workspace && node skills/xianyu-data-grabber/report.js >> logs/xianyu-report.log 2>&1
安全与隐私
敏感数据保护
- Cookie: 存储在配置文件,权限 600
- Gitee Token: 存储在配置文件,权限 600
- 数据文件: 本地存储,不上传第三方
平台合规
- 请求频率: 默认 5 秒间隔/关键词
- User-Agent: 真实浏览器标识
- 数据使用: 仅用于个人研究
性能优化
批量抓取
# 并行抓取(更快但可能被检测)
xianyu-data-grabber search --parallel 3
# 串行抓取(更慢但更安全)
xianyu-data-grabber search --sequential
缓存机制
- 截图缓存:避免重复抓取
- OCR 缓存:避免重复识别
- 数据缓存:5 分钟有效期
相关文件
- 技能文件:
skills/xianyu-data-grabber/SKILL.md - 主脚本:
skills/xianyu-data-grabber/grabber.js - OCR 脚本:
skills/xianyu-data-grabber/ocr.py - 上传脚本:
skills/xianyu-data-grabber/uploader.sh - 配置文件:
.xianyu-grabber-config.json
Changelog
v1.0.0 (2026-03-20)
- 🎉 初始版本
- Playwright + OCR 抓取
- Gitee 自动上传
- 批量关键词支持
- 自动生成报告
Files
18 totalSelect a file
Select a file to preview.
Comments
Loading comments…
