Scrape

Legal web scraping with robots.txt compliance, rate limiting, and GDPR/CCPA-aware data handling.

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 3 · 4.3k · 57 current installs · 57 all-time installs

byIván@ivangdavila

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name and description (legal, robots.txt compliance, rate limiting, GDPR/CCPA awareness) align with the provided runtime instructions and example code. There are no unrelated environment variables, binaries, or install steps requested.

ℹ

Instruction Scope

SKILL.md and code.md stay within scraping scope and emphasize legal/ethical checks (robots.txt, ToS, rate limits, avoid authenticated content, strip PII). Two items to be aware of: the robots.txt helper returns True (allows scraping) if the site has no robots.txt (a permissive policy that some teams might not want), and the guidance to embed a contact email in the User‑Agent can expose a personal address — use a role/monitored mailbox. The PII stripping guidance is high level; code doesn't include concrete PII detection/removal, so implement that carefully before storage/transmission.

✓

Install Mechanism

Instruction-only skill with no install spec and no code files executed at install time — lowest installation risk.

ℹ

Credentials

The skill requests no credentials or config paths. The only operational input shown is a contact email supplied at runtime; exposing a personal email in User-Agent is a privacy consideration but not disproportionate to the skill's purpose.

✓

Persistence & Privilege

always:false and default autonomous invocation are normal. The skill does not request persistent system privileges or claim to modify other skills or system-wide settings.

Assessment

This skill appears coherent and focused on responsible web scraping, but review a few practical points before using: (1) Prefer using an official API when available and confirm you have authorization before accessing behind‑login content. (2) Use a role or monitored address (not a personal email) as the 'contact' in User‑Agent to avoid leaking personal contact details. (3) The example robots.txt helper treats a missing robots.txt as 'allowed' — decide if your policy should be more conservative. (4) Implement concrete PII detection/removal and retention/deletion policies before persisting scraped data, and keep audit logs to demonstrate good faith. (5) If you plan to run this autonomously or at scale, consult your legal/compliance team (GDPR/CCPA and contractual ToS risk) and test against a non-production target first.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

latestvk97a6pppcchegh105cxc61we9h811xft

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Pre-Scrape Compliance Checklist

Before writing any scraping code:

robots.txt — Fetch {domain}/robots.txt, check if target path is disallowed. If yes, stop.
Terms of Service — Check /terms, /tos, /legal. Explicit scraping prohibition = need permission.
Data type — Public factual data (prices, listings) is safer. Personal data triggers GDPR/CCPA.
Authentication — Data behind login is off-limits without authorization. Never scrape protected content.
API available? — If site offers an API, use it. Always. Scraping when API exists often violates ToS.

Legal Boundaries

Public data, no login — Generally legal (hiQ v. LinkedIn 2022)
Bypassing barriers — CFAA violation risk (Van Buren v. US 2021)
Ignoring robots.txt — Gray area, often breaches ToS (Meta v. Bright Data 2024)
Personal data without consent — GDPR/CCPA violation
Republishing copyrighted content — Copyright infringement

Request Discipline

Rate limit: Minimum 2-3 seconds between requests. Faster = server strain = legal exposure.
User-Agent: Real browser string + contact email: Mozilla/5.0 ... (contact: you@email.com)
Respect 429: Exponential backoff. Ignoring 429s shows intent to harm.
Session reuse: Keep connections open to reduce server load.

Data Handling

Strip PII immediately — Don't collect names, emails, phones unless legally justified.
No fingerprinting — Don't combine data to identify individuals indirectly.
Minimize storage — Cache only what you need, delete what you don't.
Audit trail — Log what, when, where. Evidence of good faith if challenged.

For code patterns and robots.txt parser, see code.md

Files

2 total

Select a file

Select a file to preview.

Comments

Loading comments…