102 Playwright Scraper Skill

v1.0.0

Playwright-based web scraping OpenClaw Skill with anti-bot protection. Successfully tested on complex sites like Discuss.com.hk.

⭐ 0· 104·0 current·1 all-time

by@smallkeyboy

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for smallkeyboy/102-playwright-scraper-skill.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "102 Playwright Scraper Skill" (smallkeyboy/102-playwright-scraper-skill) from ClawHub.
Skill page: https://clawhub.ai/smallkeyboy/102-playwright-scraper-skill
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Canonical install target

openclaw skills install smallkeyboy/102-playwright-scraper-skill

ClawHub CLI

Package manager switcher

npx clawhub@latest install 102-playwright-scraper-skill

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name/description match the included scripts and docs. The repo implements Playwright simple and stealth scrapers, and the requested actions (npm install, npx playwright install chromium, running node scripts) are exactly what a Playwright scraper needs.

ℹ

Instruction Scope

SKILL.md and scripts focus on navigation, DOM extraction, screenshots and optional HTML saving. They include anti-bot evasion (hide navigator.webdriver, UA spoofing, human-like delays) which aligns with the declared purpose. The docs mention future integration for proxies and CAPTCHA solving (2captcha/Anti-Captcha) but these are not implemented in the provided files. No instructions were found that read unrelated system files or transmit data to hidden endpoints.

✓

Install Mechanism

There is no custom install spec in registry metadata; the package is distributed with package.json and JS files and instructs users to run `npm install` and `npx playwright install chromium`. This pulls Playwright from the npm registry and downloads Chromium via Playwright's official installer — expected for this functionality. No arbitrary personal URLs or archive extracts were used.

✓

Credentials

The skill declares no required environment variables or credentials. Scripts support optional env vars (HEADLESS, WAIT_TIME, SCREENSHOT_PATH, SAVE_HTML, USER_AGENT) that are reasonable for configuration. No secrets or unrelated service tokens are requested.

✓

Persistence & Privilege

Registry flags are default (always:false, user-invocable:true, model-invocation enabled). The skill does not request persistent platform privileges or modify other skills. It writes optional local outputs (screenshots, HTML) to file paths supplied or defaulted, which is expected behavior for a scraper.

Assessment

This skill appears to do what it says: run Playwright scrapers (including techniques to evade bot detection). Before installing, consider: (1) legal/ethical risk — evading anti-bot protections and scraping some sites may violate terms of service or local law; (2) resource impact — Playwright will download Chromium and run headful browsers (disk and RAM usage); (3) network & privacy — scraped data and screenshots are saved locally by default; if you later add CAPTCHA-solving or proxy modules, those may require third-party API keys and could introduce external data flows. Recommended precautions: run first in an isolated environment (container/VM), inspect/verify the scripts yourself (they are short and readable), and avoid supplying any sensitive credentials. If you need CAPTCHA or proxy support, vet those integrations and providers separately.

Like a lobster shell, security has layers — review code before you run it.

latestvk97fm0z8jt3zfvagvgg5sd7re584ycne

104downloads

0stars

1versions

Updated 1w ago

v1.0.0

MIT-0

Playwright Scraper Skill

A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.

🎯 Use Case Matrix

Target Website	Anti-Bot Level	Recommended Method	Script
Regular Sites	Low	web_fetch tool	N/A (built-in)
Dynamic Sites	Medium	Playwright Simple	`scripts/playwright-simple.js`
Cloudflare Protected	High	Playwright Stealth ⭐	`scripts/playwright-stealth.js`
YouTube	Special	deep-scraper	Install separately
Reddit	Special	reddit-scraper	Install separately

📦 Installation

cd playwright-scraper-skill
npm install
npx playwright install chromium

🚀 Quick Start

1️⃣ Simple Sites (No Anti-Bot)

Use OpenClaw's built-in web_fetch tool:

# Invoke directly in OpenClaw
Hey, fetch me the content from https://example.com

2️⃣ Dynamic Sites (Requires JavaScript)

Use Playwright Simple:

node scripts/playwright-simple.js "https://example.com"

Example output:

{
  "url": "https://example.com",
  "title": "Example Domain",
  "content": "...",
  "elapsedSeconds": "3.45"
}

3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)

Use Playwright Stealth:

node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"

Features:

Hide automation markers (navigator.webdriver = false)
Realistic User-Agent (iPhone, Android)
Random delays to mimic human behavior
Screenshot and HTML saving support

4️⃣ YouTube Video Transcripts

Use deep-scraper (install separately):

# Install deep-scraper skill
npx clawhub install deep-scraper

# Use it
cd skills/deep-scraper
node assets/youtube_handler.js "https://www.youtube.com/watch?v=VIDEO_ID"

📖 Script Descriptions

`scripts/playwright-simple.js`

Use Case: Regular dynamic websites
Speed: Fast (3-5 seconds)
Anti-Bot: None
Output: JSON (title, content, URL)

`scripts/playwright-stealth.js` ⭐

Use Case: Sites with Cloudflare or anti-bot protection
Speed: Medium (5-20 seconds)
Anti-Bot: Medium-High (hides automation, realistic UA)
Output: JSON + Screenshot + HTML file
Verified: 100% success on Discuss.com.hk

🎓 Best Practices

1. Try web_fetch First

If the site doesn't have dynamic loading, use OpenClaw's web_fetch tool—it's fastest.

2. Need JavaScript? Use Playwright Simple

If you need to wait for JavaScript rendering, use playwright-simple.js.

3. Getting Blocked? Use Stealth

If you encounter 403 or Cloudflare challenges, use playwright-stealth.js.

4. Special Sites Need Specialized Skills

YouTube → deep-scraper
Reddit → reddit-scraper
Twitter → bird skill

🔧 Customization

All scripts support environment variables:

# Set screenshot path
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL

# Set wait time (milliseconds)
WAIT_TIME=10000 node scripts/playwright-simple.js URL

# Enable headful mode (show browser)
HEADLESS=false node scripts/playwright-stealth.js URL

# Save HTML
SAVE_HTML=true node scripts/playwright-stealth.js URL

# Custom User-Agent
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL

📊 Performance Comparison

Method	Speed	Anti-Bot	Success Rate (Discuss.com.hk)
web_fetch	⚡ Fastest	❌ None	0%
Playwright Simple	🚀 Fast	⚠️ Low	20%
Playwright Stealth	⏱️ Medium	✅ Medium	100% ✅
Puppeteer Stealth	⏱️ Medium	✅ Medium-High	~80%
Crawlee (deep-scraper)	🐢 Slow	❌ Detected	0%
Chaser (Rust)	⏱️ Medium	❌ Detected	0%

🛡️ Anti-Bot Techniques Summary

Lessons learned from our testing:

✅ Effective Anti-Bot Measures

Hide navigator.webdriver — Essential
Realistic User-Agent — Use real devices (iPhone, Android)
Mimic Human Behavior — Random delays, scrolling
Avoid Framework Signatures — Crawlee, Selenium are easily detected
Use addInitScript (Playwright) — Inject before page load

❌ Ineffective Anti-Bot Measures

Only changing User-Agent — Not enough
Using high-level frameworks (Crawlee) — More easily detected
Docker isolation — Doesn't help with Cloudflare

🔍 Troubleshooting

Issue: 403 Forbidden

Solution: Use playwright-stealth.js

Issue: Cloudflare Challenge Page

Solution:

Increase wait time (10-15 seconds)
Try headless: false (headful mode sometimes has higher success rate)
Consider using proxy IPs

Issue: Blank Page