Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

douyin-scraper

v1.0.0

抖音图文笔记采集工具。搜索关键词 → 自动筛选「图文·一周内」→ Playwright 截图(绕过反爬虫)→ Baidu OCR 识别图片文字 → 输出 Markdown 报告(含热度评分)。当用户提到"抖音图文采集"、"抖音笔记抓取"、"抖音爬虫"、"抖音内容采集"等场景时加载此技能。

0· 109·4 current·4 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for samcheng0717/douyin-scraper.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "douyin-scraper" (samcheng0717/douyin-scraper) from ClawHub.
Skill page: https://clawhub.ai/samcheng0717/douyin-scraper
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install douyin-scraper

ClawHub CLI

Package manager switcher

npx clawhub@latest install douyin-scraper
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
!
Purpose & Capability
The skill's name/description (Douyin image-text scraping + OCR) aligns with the included scripts (Playwright scraping, screenshots, OCR, Markdown output). However the registry metadata declares no required environment variables while SKILL.md and the code require BAIDU_PADDLEOCR_TOKEN (and optionally BAIDU_PADDLEOCR_API_URL). The omission of the env requirement in metadata is an incoherence that reduces transparency.
!
Instruction Scope
SKILL.md instructs the agent to install Playwright, create a .env with BAIDU_PADDLEOCR_TOKEN, and run the login and full_workflow scripts — which is consistent with scraping + OCR. The runtime instructions do not warn that screenshots will be uploaded to a remote HTTP API; the code base64-encodes screenshots and POSTs them (with an Authorization header) to OCR_API_URL. The default OCR_API_URL in the script is aistudio-app.com (https://r41cd0p9x7dfp1s7.aistudio-app.com/layout-parsing) rather than an obviously-official Baidu API endpoint, and the README/SKILL.md do not document this alternate endpoint or the privacy implications of uploading screenshots.
Install Mechanism
There is no install spec (instruction-only install via pip/playwright commands), so nothing arbitrary is downloaded by an installer. The install instructions require pip installing Playwright and running 'playwright install chromium' — expected for the declared functionality.
!
Credentials
Requiring a Baidu PaddleOCR token is proportionate for OCR. But the registry lists no required env vars while the code requires BAIDU_PADDLEOCR_TOKEN and supports BAIDU_PADDLEOCR_API_URL. The code's default OCR_API_URL points to a non-standard domain (aistudio-app.com subdomain) which may be a third-party/proxy endpoint; this makes the token and uploaded screenshots potentially usable by that third party rather than only by an official Baidu API, which is disproportionate and not documented.
Persistence & Privilege
The skill persists a Playwright browser profile under profile/ to store login state (login.py). It does not request always:true or system-wide config changes. Headful Playwright sessions and saved browser profile are reasonable for this use-case.
What to consider before installing
This package largely does what it says (scrapes Douyin pages, screenshots content, and calls an OCR API), but there are two actionable concerns you should address before using it with sensitive data: (1) the project expects you to set BAIDU_PADDLEOCR_TOKEN (the registry omitted this requirement) — verify where that token will be used; (2) the script's default OCR_API_URL is aistudio-app.com (https://r41cd0p9x7dfp1s7.aistudio-app.com/layout-parsing), not the obvious official Baidu endpoint, meaning screenshots (base64-encoded) will be uploaded to that host by default. If you plan to use it, either (a) set BAIDU_PADDLEOCR_API_URL explicitly to the official Baidu PaddleOCR API endpoint you trust, or (b) inspect/replace the OCR POST implementation to use an OCR provider you control. Also review and be comfortable with storing a browser profile in profile/ (it contains your logged-in session) and with the legal/terms-of-service risks of scraping Douyin. If the author can confirm the aistudio-app.com URL is an official Baidu-hosted endpoint (and update docs/metadata), the concerns would be substantially reduced.

Like a lobster shell, security has layers — review code before you run it.

latestvk971keq5chb0sp2nztm7d37n6h83nvjx
109downloads
0stars
1versions
Updated 1mo ago
v1.0.0
MIT-0

douyin-scraper

抖音图文笔记采集工具 —— 一条命令完成:搜索 → 筛选图文 → 截图 → OCR → Markdown 报告。

⚠️ 前置配置

1. 安装依赖

pip install playwright requests python-dotenv
python -m playwright install chromium

2. 配置 Baidu PaddleOCR Token

在技能目录创建 .env

BAIDU_PADDLEOCR_TOKEN=你的token

获取 Token:访问 百度 AI Studio,免费注册,每天 1 万次免费调用。

3. 登录抖音(只需一次)

python <skill_path>/scripts/login.py

浏览器打开抖音,扫码登录后关闭。登录状态自动保存,后续无需重复操作。


使用

# 采集 10 篇图文笔记(含 OCR)
python <skill_path>/scripts/full_workflow.py --keyword "韩国医美"

# 指定数量
python <skill_path>/scripts/full_workflow.py --keyword "减肥餐" --count 5

# 跳过 OCR(仅截图)
python <skill_path>/scripts/full_workflow.py --keyword "咖啡" --no-ocr
参数说明默认值
--keyword搜索关键词必填
--count采集笔记数量5
--no-ocr跳过 OCR关闭

输出

报告保存至 output/notes_{keyword}_{timestamp}.md,图片保存至 data/images/

每篇笔记包含:

  • 🔥 热度分数(点赞数 / 发布天数)及计算公式
  • 👍 点赞数、发布时间、作者、原文链接
  • 📝 原文描述
  • 🔍 OCR 识别的图片文字(支持多图)
  • 🖼️ 本地截图路径

技术特点

  • Playwright 截图:通过 element.screenshot() 截取内容图,绕过抖音图片 URL 反爬虫
  • 图文过滤:自动识别并跳过视频,只采集「图文」类型笔记
  • OCR 噪音过滤:自动去除截图中的抖音导航栏文字(精选/推荐/关注 等)
  • 多图支持:一篇图文多张图片逐张截图 + OCR,合并识别结果
  • 反检测:有头浏览器(headless=False)+ 拟人操作节奏,避免触发验证码
  • 热度公式likes / days_ago,越新越热排越前

目录结构

douyin-scraper/
├── scripts/
│   ├── full_workflow.py   # 主流水线
│   └── login.py           # 登录脚本
├── data/
│   └── images/            # 截图
├── output/                # Markdown 报告
├── profile/               # 浏览器登录状态
└── .env                   # Token 配置

Comments

Loading comments...