Install
openclaw skills install @kooui/x-tweet-scraperReliably scrape X.com (Twitter) user tweets using GraphQL API with cookie-based auth. Documents all real-world pitfalls (GraphQL naming, API structure changes, rate limiting, paid-content handling). Provides working Python and Playwright scripts.
openclaw skills install @kooui/x-tweet-scraperReliaby scrape X.com tweets using the GraphQL API. This skill documents all pitfalls encountered during real-world usage and provides working scripts.
Obtain authentication cookies from an active X.com session:
https://x.comauth_token, ct0, twid, kdtcookies.json in working directoryValidate cookies:
auth_token expiration: check expirationDate field (Unix timestamp)ct0 expiration: typically 2-3 years from issue# 1. Save cookies.json (see Prerequisites above)
# 2. Run the scraper
python scripts/scrape_fixed_v2.py --username aleabitoreddit --interval 20
# 3. Output: aleabitoreddit_tweets.json (all scraped tweets)
The script auto-saves every 20 tweets (configurable via --interval).
从浏览器获取认证 Cookies:
https://x.comauth_token、ct0、twid、kdtcookies.json验证 Cookies 有效性:
auth_token 过期时间:检查 expirationDate 字段(Unix 时间戳)ct0 过期时间:通常有效期 2-3 年# 1. 保存 cookies.json(见上方准备工作)
# 2. 运行抓取脚本
python scripts/scrape_fixed_v2.py --username aleabitoreddit --interval 20
# 3. 输出:aleabitoreddit_tweets.json(所有已抓取推文)
脚本每抓取 20 条推文自动保存一次(可通过 --interval 参数调整)。
These are REAL errors encountered. Skipping any of these causes failure.
Error: API returns HTTP 422 (Unprocessable Entity)
Root cause: X.com's GraphQL API expects camelCase variable names, NOT snake_case
| Wrong ❌ | Right ✅ |
|---|---|
{"screen_name": "xxx"} | {"screenName": "xxx"} |
{"user_id": "xxx"} | {"userId": "xxx"} |
Fix: Always use camelCase for GraphQL variable names.
X.com occasionally updates their GraphQL schema. Two structures exist:
| Structure | Path to user data |
|---|---|
| Old | data.user.result |
| New | data.user_result_by_screen_name.result |
Fix: Check both paths (see scripts/scrape_fixed_v2.py for implementation).
Similarly, field names can be either camelCase or snake_case:
| Field | camelCase | snake_case |
|---|---|---|
| Entry ID | entryId | entry_id |
Fix: Check both formats in parser code.
Error: API returns HTTP 403 (Forbidden)
Root cause: Wrong value for x-csrf-token header
| Wrong ❌ | Right ✅ |
|---|---|
Use auth_token cookie value | Use ct0 cookie value |
Fix:
ct0 = next((c['value'] for c in cookies_data if c['name'] == 'ct0'), '')
headers['x-csrf-token'] = ct0
Error: API returns HTTP 429 (Too Many Requests)
Solution:
--delay 2 (2 seconds, default)Symptom: Some tweets return {"__typename": "TweetPreviewDisplay"} instead of {"__typename": "Tweet"}
Cause: These are Super Follow paid-content tweets. API cannot fetch full text.
Fix: Skip these entries in API-based scraping. For browser automation: same limitation applies unless the logged-in account has subscribed.
Symptom: Script only fetches first page (~20 tweets), then stops
Cause: Cursor value not extracted correctly
Fix: Check multiple possible paths (see scripts/scrape_fixed_v2.py for implementation).
Symptom: created_at field is empty or parsing fails
Cause: X.com API returns two time formats:
2026-06-10T11:43:40.000ZWed Jun 10 11:43:40 +0000 2026Fix: Implement dual-format parser (see scripts/scrape_fixed_v2.py).
Error: Playwright fails to set cookies, browser shows "Not logged in"
Cause: sameSite: "no_restriction" is invalid in Playwright
| Wrong ❌ | Right ✅ |
|---|---|
"no_restriction" | "None" |
"strict" | "Strict" |
"lax" | "Lax" |
Fix: Map cookie sameSite values before passing to Playwright.
scripts/scrape_fixed_v2.py (Primary / 主脚本)The main scraping script. Features:
Usage:
python scripts/scrape_fixed_v2.py --username TARGET_USERNAME --interval 20
scripts/scrape_with_playwright.js (Fallback / 备用方案)Node.js script using Playwright. Use when:
Limitation: Cannot bypass Super Follow paywalls unless the logged-in account has subscribed.
Usage:
npm install playwright
npx playwright install chromium
node scripts/scrape_with_playwright.js
cookies.json (see Prerequisites above)python scripts/scrape_fixed_v2.py --username TARGETTARGET_tweets.jsonX.com GraphQL API doesn't support date-range filtering directly. Workaround:
import json
with open('TARGET_tweets.json') as f:
tweets = json.load(f)
filtered = [t for t in tweets if '2026-06-21' <= t.get('created_at', '') <= '2026-06-25']
Limitation (Important): Browser automation CANNOT bypass Super Follow paywalls.
Workarounds:
scripts/scrape_with_playwright.js) — same limitation applies| Error | Cause | Solution |
|---|---|---|
| HTTP 422 | Wrong GraphQL variable name | Use camelCase (screenName not screen_name) |
| HTTP 403 | Wrong x-csrf-token | Use ct0 cookie value, not auth_token |
| HTTP 429 | Rate limited | Wait 15-60 min, increase --delay |
| 0 tweets extracted | API structure changed | Check api_response_debug.json, adapt parser |
| Missing recent tweets | Paid content | Use browser automation or manual copy |
created_at is empty | Time parsing failed | Check parse_time() handles both formats |
| Playwright: not logged in | Cookie sameSite format | Map no_restriction → None |
The scraper outputs JSON in this format:
{
"username": "aleabitoreddit",
"total_tweets": 839,
"fetch_time": "2026-06-24T14:30:00Z",
"note": "Scraped using x-tweet-scraper skill",
"tweets": [
{
"id": "1900677287501488511",
"text": "Full tweet text here...",
"created_at": "2026-06-24T14:30:00Z",
"timestamp": 1750777800,
"favorite_count": 123,
"retweet_count": 45,
"reply_count": 12,
"quote_count": 8,
"is_retweet": false,
"is_quote": false,
"in_reply_to_status_id": null,
"in_reply_to_user_id": null,
"user": {
"screen_name": "aleabitoreddit",
"user_id": "2007761757088194560"
}
}
]
}
references/api_structure.md: Full documentation of X.com GraphQL API structurereferences/error_codes.md: HTTP error codes and solutionsreferences/field_mapping.md: Old vs new field name mapping table