Install
openclaw skills install @cgxxxxxxxxxxxx/playwright-ocrAutomated web data extraction using Playwright for browser automation and OCR for text recognition. Use when you need to extract data from dynamic web pages, charts, or visual elements that require both browser automation and optical character recognition. v2.0: Added batch processing, data validation, and error recovery.
openclaw skills install @cgxxxxxxxxxxxx/playwright-ocrplaywright_ocr/
├── SKILL.md # This file
├── scripts/
│ ├── extract_data.js # Playwright browser automation
│ ├── process_ocr.py # OCR text recognition
│ └── upload_csv.py # Data export and upload
└── output/
└── extracted_data.csv # Extracted data
npm install playwright
npx playwright install chromium
pip3 install pytesseract pillow
apt-get install tesseract-ocr # Linux
cd /root/.openclaw/workspace/skills/playwright_ocr
node scripts/extract_data.js --url "https://example.com/chart" --output data.json
python3 scripts/process_ocr.py --input screenshots/ --output data.csv
# Configure target URL and selectors
export TARGET_URL="https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F"
export OUTPUT_DIR="/root/.openclaw/workspace/output"
# Run extraction
python3 scripts/run_pipeline.py
# Add to crontab for daily extraction
0 2 * * * cd /root/.openclaw/workspace/skills/playwright_ocr && python3 scripts/run_pipeline.py
Browser Automation (Playwright)
OCR Processing (Tesseract)
Data Export
日期,模型名称,Token 消耗,请求次数,费用 (USD)
2026-02-16,Others,67500000000,0,0
2026-02-16,Step 3.5 Flash,55300000000,0,0
{
"extraction_date": "2026-03-18",
"source_url": "https://openrouter.ai/apps",
"data": [
{
"date": "2026-02-16",
"model": "Others",
"tokens": 67500000000
}
]
}
extract_data.js.env file# Install system dependencies
npx playwright install-deps
# Install additional language packs
sudo apt-get install tesseract-ocr-eng
# Use image pre-processing
python3 scripts/preprocess_image.py --input screenshot.png
# Increase wait time for dynamic content
# Check selectors in extract_data.js
# Enable debug mode: export DEBUG=playwright:*
新增功能:
批量处理
数据验证
错误恢复
PaddleOCR 集成
使用示例:
# 批量处理整个目录
python3 scripts/batch_ocr_processor.py \
--input /path/to/pdfs \
--output /path/to/results \
--lang chinese_cht \
--parallel 4
# 带验证的提取
python3 scripts/extract_data.js \
--validate \
--confidence-threshold 0.9 \
--review-queue
性能提升: