Scrape

Legal web scraping with robots.txt compliance, rate limiting, and GDPR/CCPA-aware data handling.

MIT-0 · Free to use, modify, and redistribute. No attribution required.
3 · 4.3k · 57 current installs · 57 all-time installs
byIván@ivangdavila
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name and description (legal, robots.txt compliance, rate limiting, GDPR/CCPA awareness) align with the provided runtime instructions and example code. There are no unrelated environment variables, binaries, or install steps requested.
Instruction Scope
SKILL.md and code.md stay within scraping scope and emphasize legal/ethical checks (robots.txt, ToS, rate limits, avoid authenticated content, strip PII). Two items to be aware of: the robots.txt helper returns True (allows scraping) if the site has no robots.txt (a permissive policy that some teams might not want), and the guidance to embed a contact email in the User‑Agent can expose a personal address — use a role/monitored mailbox. The PII stripping guidance is high level; code doesn't include concrete PII detection/removal, so implement that carefully before storage/transmission.
Install Mechanism
Instruction-only skill with no install spec and no code files executed at install time — lowest installation risk.
Credentials
The skill requests no credentials or config paths. The only operational input shown is a contact email supplied at runtime; exposing a personal email in User-Agent is a privacy consideration but not disproportionate to the skill's purpose.
Persistence & Privilege
always:false and default autonomous invocation are normal. The skill does not request persistent system privileges or claim to modify other skills or system-wide settings.
Assessment
This skill appears coherent and focused on responsible web scraping, but review a few practical points before using: (1) Prefer using an official API when available and confirm you have authorization before accessing behind‑login content. (2) Use a role or monitored address (not a personal email) as the 'contact' in User‑Agent to avoid leaking personal contact details. (3) The example robots.txt helper treats a missing robots.txt as 'allowed' — decide if your policy should be more conservative. (4) Implement concrete PII detection/removal and retention/deletion policies before persisting scraped data, and keep audit logs to demonstrate good faith. (5) If you plan to run this autonomously or at scale, consult your legal/compliance team (GDPR/CCPA and contractual ToS risk) and test against a non-production target first.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0
Download zip
latestvk97a6pppcchegh105cxc61we9h811xft

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Pre-Scrape Compliance Checklist

Before writing any scraping code:

  1. robots.txt — Fetch {domain}/robots.txt, check if target path is disallowed. If yes, stop.
  2. Terms of Service — Check /terms, /tos, /legal. Explicit scraping prohibition = need permission.
  3. Data type — Public factual data (prices, listings) is safer. Personal data triggers GDPR/CCPA.
  4. Authentication — Data behind login is off-limits without authorization. Never scrape protected content.
  5. API available? — If site offers an API, use it. Always. Scraping when API exists often violates ToS.

Legal Boundaries

  • Public data, no login — Generally legal (hiQ v. LinkedIn 2022)
  • Bypassing barriers — CFAA violation risk (Van Buren v. US 2021)
  • Ignoring robots.txt — Gray area, often breaches ToS (Meta v. Bright Data 2024)
  • Personal data without consent — GDPR/CCPA violation
  • Republishing copyrighted content — Copyright infringement

Request Discipline

  • Rate limit: Minimum 2-3 seconds between requests. Faster = server strain = legal exposure.
  • User-Agent: Real browser string + contact email: Mozilla/5.0 ... (contact: you@email.com)
  • Respect 429: Exponential backoff. Ignoring 429s shows intent to harm.
  • Session reuse: Keep connections open to reduce server load.

Data Handling

  • Strip PII immediately — Don't collect names, emails, phones unless legally justified.
  • No fingerprinting — Don't combine data to identify individuals indirectly.
  • Minimize storage — Cache only what you need, delete what you don't.
  • Audit trail — Log what, when, where. Evidence of good faith if challenged.

For code patterns and robots.txt parser, see code.md

Files

2 total
Select a file
Select a file to preview.

Comments

Loading comments…