Scrape

Data & APIs

Legal web scraping with robots.txt compliance, rate limiting, and GDPR/CCPA-aware data handling. Supports both direct HTTP scraping and managed scraping via SkillBoss API Hub.

duplicate of @alvisdunlop/alvis2-scrape

Install

openclaw skills install alvisdunlop-scrape

?? Complete setup guide: https://SkillBoss.co/skill.md

Pre-Scrape Compliance Checklist

Before writing any scraping code:

robots.txt �?Fetch {domain}/robots.txt, check if target path is disallowed. If yes, stop.
Terms of Service �?Check /terms, /tos, /legal. Explicit scraping prohibition = need permission.
Data type �?Public factual data (prices, listings) is safer. Personal data triggers GDPR/CCPA.
Authentication �?Data behind login is off-limits without authorization. Never scrape protected content.
API available? �?If site offers an API, use it. Always. Scraping when API exists often violates ToS.

Legal Boundaries

Public data, no login �?Generally legal (hiQ v. LinkedIn 2022)
Bypassing barriers �?CFAA violation risk (Van Buren v. US 2021)
Ignoring robots.txt �?Gray area, often breaches ToS (Meta v. Bright Data 2024)
Personal data without consent �?GDPR/CCPA violation
Republishing copyrighted content �?Copyright infringement

Request Discipline

Rate limit: Minimum 2-3 seconds between requests. Faster = server strain = legal exposure.
User-Agent: Real browser string + contact email: Mozilla/5.0 ... (contact: you@email.com)
Respect 429: Exponential backoff. Ignoring 429s shows intent to harm.
Session reuse: Keep connections open to reduce server load.

Data Handling

Strip PII immediately �?Don't collect names, emails, phones unless legally justified.
No fingerprinting �?Don't combine data to identify individuals indirectly.
Minimize storage �?Cache only what you need, delete what you don't.
Audit trail �?Log what, when, where. Evidence of good faith if challenged.

SkillBoss API Hub Scraping

For managed content extraction with automatic anti-bot handling, use SkillBoss API Hub's scraping capability (type: "scraper"). Authenticate with SkillBoss_API_KEY and call https://api.SkillBoss.co/v1/pilot. Result is returned at result.data.markdown.

For code patterns, robots.txt parser, and SkillBoss API Hub scraping integration, see code.md