Playwright Scraper Skill

Playwright-based web scraping OpenClaw Skill with anti-bot protection. Successfully tested on complex sites like Discuss.com.hk.

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 40 · 18.5k · 199 current installs · 204 all-time installs

by@waisimon

MIT-0

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

Purpose & Capability

The skill's name, description, SKILL.md and bundled scripts are coherent with a Playwright-based web scraper that implements anti-bot/stealth techniques. However, the registry metadata claims "Required binaries: none" and "instruction-only" while the documentation and scripts clearly require Node.js/npx and Playwright (and download Chromium). The missing declaration of those runtime dependencies is an inconsistency that matters for installation and security posture.

ℹ

Instruction Scope

SKILL.md and the scripts instruct the agent to install dependencies (npm install, npx playwright install chromium) and run local JS scripts that (a) alter navigator properties to hide automation markers, (b) set UA, (c) save screenshots/HTML, and (d) optionally use proxies/CAPTCHA services in future. All of these are within the stated scraping purpose. The instructions do encourage evasive techniques (proxies, CAPTCHA solving) which enable circumvention of anti-bot controls — that is legitimate for scraping but increases misuse risk. The scripts do not exfiltrate data to external endpoints or read arbitrary system files beyond writing screenshots/HTML to disk.

ℹ

Install Mechanism

There is no registry install spec, but package.json/package-lock are present and point to Playwright from the public npm registry (resolved to known packages). Installation uses standard npm and npx playwright install chromium which will download browser binaries. No remote arbitrary download URLs, URL shorteners, or personal servers were used in the manifest. This is a common but non-trivial install step (large browser download, network access).

✓

Credentials

The skill does not request secret environment variables or credentials. The scripts accept non-sensitive env vars (WAIT_TIME, SCREENSHOT_PATH, HEADLESS, USER_AGENT, SAVE_HTML). SKILL.md mentions future CAPTCHA/proxy integrations (which would require service keys) but these are not present in the current code. Current env/credential requests are proportionate to the stated purpose.

✓

Persistence & Privilege

The skill is not always-included and is user-invocable. It does not request elevated system privileges or modify other skills or global agent configuration. Running the scripts will write files (screenshots, HTML) within the working directory or provided paths — expected behavior for a scraper.

What to consider before installing

This package appears to be a legitimate Playwright-based scraper and the code implements stealth techniques to evade anti-bot protections. Before installing or running it: - Be aware of the metadata mismatch: you will need Node (recommended v18+), npm/npx, and Playwright; the skill will download Chromium (significant disk + network). The registry entry did not declare these required binaries. - Run in an isolated environment (container/VM) if you want to limit risk from running untrusted code and browser binaries. - Review the scripts yourself (they are small and included) — they do not call external C2 endpoints or exfiltrate secrets, but they do modify navigator properties to hide automation markers and encourage use of proxies/CAPTCHA-solving in future. - Avoid supplying any sensitive API keys (anti-captcha, proxy credentials) unless you trust the code and the maintainer; those integrations would increase risk if added later. - Consider legal and terms-of-service risks: the skill actively helps bypass anti-bot measures (proxies, headful mode, navigator masking). Using it against sites that disallow scraping can violate laws or terms. If you want to proceed, ensure Node/npm are installed, inspect package.json/package-lock, run npm install and npx playwright install chromium in a controlled environment, and test on benign pages first.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.2.0

Download zip

latestvk974r41cp0t749r4xhs5kr69bx80p2za

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Playwright Scraper Skill

A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.

🎯 Use Case Matrix

Target Website	Anti-Bot Level	Recommended Method	Script
Regular Sites	Low	web_fetch tool	N/A (built-in)
Dynamic Sites	Medium	Playwright Simple	`scripts/playwright-simple.js`
Cloudflare Protected	High	Playwright Stealth ⭐	`scripts/playwright-stealth.js`
YouTube	Special	deep-scraper	Install separately
Reddit	Special	reddit-scraper	Install separately

📦 Installation

cd playwright-scraper-skill
npm install
npx playwright install chromium

🚀 Quick Start

1️⃣ Simple Sites (No Anti-Bot)

Use OpenClaw's built-in web_fetch tool:

# Invoke directly in OpenClaw
Hey, fetch me the content from https://example.com

2️⃣ Dynamic Sites (Requires JavaScript)

Use Playwright Simple:

node scripts/playwright-simple.js "https://example.com"

Example output:

{
  "url": "https://example.com",
  "title": "Example Domain",
  "content": "...",
  "elapsedSeconds": "3.45"
}

3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)

Use Playwright Stealth:

node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"

Features:

Hide automation markers (navigator.webdriver = false)
Realistic User-Agent (iPhone, Android)
Random delays to mimic human behavior
Screenshot and HTML saving support

4️⃣ YouTube Video Transcripts

Use deep-scraper (install separately):

# Install deep-scraper skill
npx clawhub install deep-scraper

# Use it
cd skills/deep-scraper
node assets/youtube_handler.js "https://www.youtube.com/watch?v=VIDEO_ID"

📖 Script Descriptions

`scripts/playwright-simple.js`

Use Case: Regular dynamic websites
Speed: Fast (3-5 seconds)
Anti-Bot: None
Output: JSON (title, content, URL)

`scripts/playwright-stealth.js` ⭐

Use Case: Sites with Cloudflare or anti-bot protection
Speed: Medium (5-20 seconds)
Anti-Bot: Medium-High (hides automation, realistic UA)
Output: JSON + Screenshot + HTML file
Verified: 100% success on Discuss.com.hk

🎓 Best Practices

1. Try web_fetch First

If the site doesn't have dynamic loading, use OpenClaw's web_fetch tool—it's fastest.

2. Need JavaScript? Use Playwright Simple

If you need to wait for JavaScript rendering, use playwright-simple.js.

3. Getting Blocked? Use Stealth

If you encounter 403 or Cloudflare challenges, use playwright-stealth.js.

4. Special Sites Need Specialized Skills

YouTube → deep-scraper
Reddit → reddit-scraper
Twitter → bird skill

🔧 Customization

All scripts support environment variables:

# Set screenshot path
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL

# Set wait time (milliseconds)
WAIT_TIME=10000 node scripts/playwright-simple.js URL

# Enable headful mode (show browser)
HEADLESS=false node scripts/playwright-stealth.js URL

# Save HTML
SAVE_HTML=true node scripts/playwright-stealth.js URL

# Custom User-Agent
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL

📊 Performance Comparison

Method	Speed	Anti-Bot	Success Rate (Discuss.com.hk)
web_fetch	⚡ Fastest	❌ None	0%
Playwright Simple	🚀 Fast	⚠️ Low	20%
Playwright Stealth	⏱️ Medium	✅ Medium	100% ✅
Puppeteer Stealth	⏱️ Medium	✅ Medium-High	~80%
Crawlee (deep-scraper)	🐢 Slow	❌ Detected	0%
Chaser (Rust)	⏱️ Medium	❌ Detected	0%

🛡️ Anti-Bot Techniques Summary

Lessons learned from our testing:

✅ Effective Anti-Bot Measures

Hide navigator.webdriver — Essential
Realistic User-Agent — Use real devices (iPhone, Android)
Mimic Human Behavior — Random delays, scrolling
Avoid Framework Signatures — Crawlee, Selenium are easily detected
Use addInitScript (Playwright) — Inject before page load

❌ Ineffective Anti-Bot Measures

Only changing User-Agent — Not enough
Using high-level frameworks (Crawlee) — More easily detected
Docker isolation — Doesn't help with Cloudflare

🔍 Troubleshooting

Issue: 403 Forbidden

Solution: Use playwright-stealth.js

Issue: Cloudflare Challenge Page

Solution:

Increase wait time (10-15 seconds)
Try headless: false (headful mode sometimes has higher success rate)
Consider using proxy IPs

Issue: Blank Page

Solution:

Increase waitForTimeout
Use waitUntil: 'networkidle' or 'domcontentloaded'
Check if login is required

📝 Memory & Experience

2026-02-07 Discuss.com.hk Test Conclusions

✅ Pure Playwright + Stealth succeeded (5s, 200 OK)
❌ Crawlee (deep-scraper) failed (403)
❌ Chaser (Rust) failed (Cloudflare)
❌ Puppeteer standard failed (403)

Best Solution: Pure Playwright + anti-bot techniques (framework-independent)

🚧 Future Improvements

Add proxy IP rotation
Implement cookie management (maintain login state)
Add CAPTCHA handling (2captcha / Anti-Captcha)
Batch scraping (parallel URLs)
Integration with OpenClaw's browser tool

📚 References

Files

13 total

Select a file

Select a file to preview.

Comments

Loading comments…

Playwright Scraper Skill

License

SKILL.md

Playwright Scraper Skill

🎯 Use Case Matrix

📦 Installation

🚀 Quick Start

1️⃣ Simple Sites (No Anti-Bot)

2️⃣ Dynamic Sites (Requires JavaScript)

3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)

4️⃣ YouTube Video Transcripts

📖 Script Descriptions

scripts/playwright-simple.js

scripts/playwright-stealth.js ⭐

🎓 Best Practices

1. Try web_fetch First

2. Need JavaScript? Use Playwright Simple

3. Getting Blocked? Use Stealth

4. Special Sites Need Specialized Skills

🔧 Customization

📊 Performance Comparison

🛡️ Anti-Bot Techniques Summary

✅ Effective Anti-Bot Measures

❌ Ineffective Anti-Bot Measures

🔍 Troubleshooting

Issue: 403 Forbidden

Issue: Cloudflare Challenge Page

Issue: Blank Page

📝 Memory & Experience

2026-02-07 Discuss.com.hk Test Conclusions

🚧 Future Improvements

📚 References

Files

Comments

`scripts/playwright-simple.js`

`scripts/playwright-stealth.js` ⭐