X.com 推文抓取工具

Reliably scrape X.com (Twitter) user tweets using GraphQL API with cookie-based auth. Documents all real-world pitfalls (GraphQL naming, API structure changes, rate limiting, paid-content handling). Provides working Python and Playwright scripts.

InitialDD@kooui

Install

openclaw skills install @kooui/x-tweet-scraper

X.com Tweet Scraper Skill / X.com 推文抓取技能

English

Reliaby scrape X.com tweets using the GraphQL API. This skill documents all pitfalls encountered during real-world usage and provides working scripts.

Quick Start

Prerequisites

Obtain authentication cookies from an active X.com session:
- Log in to X.com in browser
- Press F12 → Application → Cookies → https://x.com
- Export these critical cookies as JSON array: auth_token, ct0, twid, kdt
- Save as cookies.json in working directory
Validate cookies:
- auth_token expiration: check expirationDate field (Unix timestamp)
- ct0 expiration: typically 2-3 years from issue
- If expired: re-login to X.com and re-export

Minimal Working Example

bash

# 1. Save cookies.json (see Prerequisites above)

# 2. Run the scraper
python scripts/scrape_fixed_v2.py --username aleabitoreddit --interval 20

# 3. Output: aleabitoreddit_tweets.json (all scraped tweets)

The script auto-saves every 20 tweets (configurable via --interval).

中文快速上手

准备工作

从浏览器获取认证 Cookies：
- 在浏览器中登录 X.com
- 按 F12 → Application → Cookies → https://x.com
- 导出关键 Cookies 为 JSON 数组：auth_token、ct0、twid、kdt
- 保存为工作目录下的 cookies.json
验证 Cookies 有效性：
- auth_token 过期时间：检查 expirationDate 字段（Unix 时间戳）
- ct0 过期时间：通常有效期 2-3 年
- 如已过期：重新登录 X.com 并重新导出

最简运行示例

bash

# 1. 保存 cookies.json（见上方准备工作）

# 2. 运行抓取脚本
python scripts/scrape_fixed_v2.py --username aleabitoreddit --interval 20

# 3. 输出：aleabitoreddit_tweets.json（所有已抓取推文）

脚本每抓取 20 条推文自动保存一次（可通过 --interval 参数调整）。

Documented Pitfalls (Critical / 踩坑记录 - 关键)

These are REAL errors encountered. Skipping any of these causes failure.

Pitfall 1: GraphQL Variable Name Format

Error: API returns HTTP 422 (Unprocessable Entity)

Root cause: X.com's GraphQL API expects camelCase variable names, NOT snake_case

Wrong ❌	Right ✅
`{"screen_name": "xxx"}`	`{"screenName": "xxx"}`
`{"user_id": "xxx"}`	`{"userId": "xxx"}`

Fix: Always use camelCase for GraphQL variable names.

Pitfall 2: API Response Structure Changes

X.com occasionally updates their GraphQL schema. Two structures exist:

Structure	Path to user data
Old	`data.user.result`
New	`data.user_result_by_screen_name.result`

Fix: Check both paths (see scripts/scrape_fixed_v2.py for implementation).

Similarly, field names can be either camelCase or snake_case:

Field	camelCase	snake_case
Entry ID	`entryId`	`entry_id`

Fix: Check both formats in parser code.

Pitfall 3: x-csrf-token Header

Error: API returns HTTP 403 (Forbidden)

Root cause: Wrong value for x-csrf-token header

Wrong ❌	Right ✅
Use `auth_token` cookie value	Use `ct0` cookie value

Fix:

python

ct0 = next((c['value'] for c in cookies_data if c['name'] == 'ct0'), '')
headers['x-csrf-token'] = ct0

Pitfall 4: Rate Limiting (HTTP 429)

Error: API returns HTTP 429 (Too Many Requests)

Solution:

Add delay between requests: --delay 2 (2 seconds, default)
If already rate-limited: wait 15-60 minutes for reset
Use cursor-based pagination (don't re-fetch already-scraped tweets)

Pitfall 5: Paid Content (TweetPreviewDisplay)

Symptom: Some tweets return {"__typename": "TweetPreviewDisplay"} instead of {"__typename": "Tweet"}

Cause: These are Super Follow paid-content tweets. API cannot fetch full text.

Fix: Skip these entries in API-based scraping. For browser automation: same limitation applies unless the logged-in account has subscribed.

Pitfall 6: Cursor Extraction

Symptom: Script only fetches first page (~20 tweets), then stops

Cause: Cursor value not extracted correctly

Fix: Check multiple possible paths (see scripts/scrape_fixed_v2.py for implementation).

Pitfall 7: Time Parsing

Symptom: created_at field is empty or parsing fails

Cause: X.com API returns two time formats:

ISO format: 2026-06-10T11:43:40.000Z
RFC format: Wed Jun 10 11:43:40 +0000 2026

Fix: Implement dual-format parser (see scripts/scrape_fixed_v2.py).

Pitfall 8: Cookie sameSite Format (Playwright Only)

Error: Playwright fails to set cookies, browser shows "Not logged in"

Cause: sameSite: "no_restriction" is invalid in Playwright

Wrong ❌	Right ✅
`"no_restriction"`	`"None"`
`"strict"`	`"Strict"`
`"lax"`	`"Lax"`

Fix: Map cookie sameSite values before passing to Playwright.

Scripts Included / 附带的脚本

`scripts/scrape_fixed_v2.py` (Primary / 主脚本)

The main scraping script. Features:

✅ Compatible with old and new API response structures
✅ Handles both camelCase and snake_case field names
✅ Skips paid content (TweetPreviewDisplay)
✅ Auto-saves every N tweets (configurable)
✅ Cursor-based pagination
✅ Rate limit handling (with delay)

Usage:

bash

python scripts/scrape_fixed_v2.py --username TARGET_USERNAME --interval 20

`scripts/scrape_with_playwright.js` (Fallback / 备用方案)

Node.js script using Playwright. Use when:

GraphQL API returns 429 (rate limited) and you need data immediately
API structure changes break the Python script
You want visual verification (screenshots)

Limitation: Cannot bypass Super Follow paywalls unless the logged-in account has subscribed.

Usage:

bash

npm install playwright
npx playwright install chromium
node scripts/scrape_with_playwright.js

Step-by-Step Scraping Workflow / 抓取工作流

Scenario A: Scrape All Tweets from a User (抓取某用户全部推文)

Obtain and save cookies.json (see Prerequisites above)
Run: python scripts/scrape_fixed_v2.py --username TARGET
Wait for completion (monitor rate limits)
Output: TARGET_tweets.json

Scenario B: Scrape Specific Date Range (抓取指定日期范围)

X.com GraphQL API doesn't support date-range filtering directly. Workaround:

Scrape all tweets (Scenario A)

Filter by date in post-processing:

python

import json
with open('TARGET_tweets.json') as f:
    tweets = json.load(f)
filtered = [t for t in tweets if '2026-06-21' <= t.get('created_at', '') <= '2026-06-25']

Scenario C: Fetch Paid Content / 抓取付费内容

Limitation (Important): Browser automation CANNOT bypass Super Follow paywalls.

If the logged-in account has NOT subscribed: only preview text (~200 chars) is visible
If the logged-in account HAS subscribed: full text is visible and can be scraped

Workarounds:

Use Playwright script (scripts/scrape_with_playwright.js) — same limitation applies
Manually copy-paste tweet text from browser (most reliable)
Subscribe to the account, then re-run the script

Troubleshooting / 故障排查

Error	Cause	Solution
HTTP 422	Wrong GraphQL variable name	Use camelCase (`screenName` not `screen_name`)
HTTP 403	Wrong x-csrf-token	Use `ct0` cookie value, not `auth_token`
HTTP 429	Rate limited	Wait 15-60 min, increase `--delay`
0 tweets extracted	API structure changed	Check `api_response_debug.json`, adapt parser
Missing recent tweets	Paid content	Use browser automation or manual copy
`created_at` is empty	Time parsing failed	Check `parse_time()` handles both formats
Playwright: not logged in	Cookie sameSite format	Map `no_restriction` → `None`

File Output Format / 输出文件格式

The scraper outputs JSON in this format:

json

{
  "username": "aleabitoreddit",
  "total_tweets": 839,
  "fetch_time": "2026-06-24T14:30:00Z",
  "note": "Scraped using x-tweet-scraper skill",
  "tweets": [
    {
      "id": "1900677287501488511",
      "text": "Full tweet text here...",
      "created_at": "2026-06-24T14:30:00Z",
      "timestamp": 1750777800,
      "favorite_count": 123,
      "retweet_count": 45,
      "reply_count": 12,
      "quote_count": 8,
      "is_retweet": false,
      "is_quote": false,
      "in_reply_to_status_id": null,
      "in_reply_to_user_id": null,
      "user": {
        "screen_name": "aleabitoreddit",
        "user_id": "2007761757088194560"
      }
    }
  ]
}

References / 参考文档

references/api_structure.md: Full documentation of X.com GraphQL API structure
references/error_codes.md: HTTP error codes and solutions
references/field_mapping.md: Old vs new field name mapping table

X.com 推文抓取工具

Install

X.com Tweet Scraper Skill / X.com 推文抓取技能

English

Quick Start

Prerequisites

Minimal Working Example

中文快速上手

准备工作

最简运行示例

Documented Pitfalls (Critical / 踩坑记录 - 关键)

Pitfall 1: GraphQL Variable Name Format

Pitfall 2: API Response Structure Changes

Pitfall 3: x-csrf-token Header

Pitfall 4: Rate Limiting (HTTP 429)

Pitfall 5: Paid Content (TweetPreviewDisplay)

Pitfall 6: Cursor Extraction

Pitfall 7: Time Parsing

Pitfall 8: Cookie sameSite Format (Playwright Only)

Scripts Included / 附带的脚本

scripts/scrape_fixed_v2.py (Primary / 主脚本)

scripts/scrape_with_playwright.js (Fallback / 备用方案)

Step-by-Step Scraping Workflow / 抓取工作流

Scenario A: Scrape All Tweets from a User (抓取某用户全部推文)

Scenario B: Scrape Specific Date Range (抓取指定日期范围)

Scenario C: Fetch Paid Content / 抓取付费内容

Troubleshooting / 故障排查

File Output Format / 输出文件格式

References / 参考文档

`scripts/scrape_fixed_v2.py` (Primary / 主脚本)

`scripts/scrape_with_playwright.js` (Fallback / 备用方案)