Social Media Data Collector

Multi-platform social media data collection and aggregation for content performance tracking. Use when: (1) collecting engagement metrics (views/likes/comments/shares) across multiple platforms, (2) filling bitable/spreadsheet with social media performance data, (3) tracking content distribution results across 10+ platforms, (4) need to scrape platforms without APIs. Covers: Douyin, Weibo, Kuaishou, Bilibili, Toutiao, Xiaohongshu, WeChat Video (视频号), Autohome (汽车之家), Yiche (易车), Baijiahao (百家号), Douyu (斗鱼), Pipixia (皮皮虾), Dongchedi (懂车帝), TikTok, YouTube. NOT for: posting content, account management, or social listening/monitoring.

Audits

Pass

ClawScanReview

Agentic behavior and permission review.

Static analysisPass

Pattern checks against bundled files.

VirusTotalPass

Multi-engine malware detections and file reputation.

Install

openclaw skills install social-media-data-collector

Social Media Data Collector

Overview

Collect engagement metrics from 13+ platforms, aggregate into structured format (飞书多维表格/CSV). Three-tier approach: API first → browser scrape fallback → manual flag.

Execution Flow

Classify platforms by data access method (see references/platform-guide.md)
API tier — call APIs for platforms with programmatic access
Browser tier — Playwright render + text extraction for remaining
Aggregate — normalize data, write to target (bitable/CSV)
Cleanup — remove screenshots, temp files, browser cache

Platform Tiers

Tier	Platforms	Method
API-first	抖音, 微博, 快手, B站, 今日头条, 小红书	TikHub API / BlueAI Crawler
Browser-scrape	百家号, 汽车之家, 易车, 视频号, 斗鱼, 皮皮虾	Playwright headless
API+scrape	懂车帝	TikHub (limited) + scrape

Model Strategy (Token Optimization)

Problem

Using opus/sonnet for the entire pipeline wastes tokens on mechanical tasks.

Recommended Model Split

Phase	Model	Why
Planning & classification	opus/sonnet	Needs reasoning
API calls & JSON parsing	haiku/flash	Mechanical, no reasoning needed
Browser text extraction	Code (no LLM)	Pure Python, no model call
Data normalization	haiku/flash	Simple mapping
Report/summary	sonnet	Needs synthesis

Implementation

Use scripts/collect_api.py for API tier — zero LLM tokens (pure code)
Use scripts/collect_browser.py for browser tier — zero LLM tokens (pure code)
Only invoke LLM for: planning which platforms to hit, handling errors, writing summaries

Token Budget Estimate (per 13-platform run)

With current approach (all-opus): ~80k tokens
With optimized approach (code scripts + haiku routing): ~5k tokens
Savings: 94%

Key Commands

# Full collection run
python3 scripts/collect_api.py --config /tmp/sm-collect/config.json

# Browser scrape specific platforms  
python3 scripts/collect_browser.py --platforms "百家号,汽车之家,视频号"

# Write to bitable
python3 scripts/write_bitable.py --app-token XXX --table-id YYY --data /tmp/sm-collect/results.json

# Cleanup
rm -rf /tmp/sm-collect/ /tmp/screenshots/

Bitable Field Mapping

多维表格字段	类型	说明
播放量	text	带"万"后缀的文本
点赞	number	纯数字
评论	number	纯数字
分享	number	纯数字
收藏	number	纯数字
互动量合计	text	带"万"后缀的文本
数据统计日期	text	格式 "2026.5.15"

⚠️ 注意 播放量 和 互动量合计 是 text 类型，不是 number！传数字会报 TextFieldConvFail。

Cleanup Protocol

After each collection run, delete:

/tmp/sm-collect/ (intermediate JSON)
/tmp/screenshots/ (browser screenshots)
/tmp/subagent-out/ (if spawned sub-agents)
Any .json temp files in workspace

Error Handling

API 403/401 → token expired, refresh and retry once
Browser timeout → increase to 25s, retry with wait_until="domcontentloaded"
Platform redirects → check URL is correct (易车 hao vs sv domain!)
Empty data → flag for manual check, don't guess

Platform-Specific Notes

See references/platform-guide.md for detailed per-platform experience including:

Authentication requirements
URL patterns and gotchas
Data extraction selectors
Known limitations