Scrape

v1.0.0

Legal web scraping with robots.txt compliance, rate limiting, and GDPR/CCPA-aware data handling.

8· 6.4k·79 current·80 all-time
byIván@ivangdavila
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name and description (legal, robots.txt compliance, rate limiting, GDPR/CCPA awareness) align with the provided runtime instructions and example code. There are no unrelated environment variables, binaries, or install steps requested.
Instruction Scope
SKILL.md and code.md stay within scraping scope and emphasize legal/ethical checks (robots.txt, ToS, rate limits, avoid authenticated content, strip PII). Two items to be aware of: the robots.txt helper returns True (allows scraping) if the site has no robots.txt (a permissive policy that some teams might not want), and the guidance to embed a contact email in the User‑Agent can expose a personal address — use a role/monitored mailbox. The PII stripping guidance is high level; code doesn't include concrete PII detection/removal, so implement that carefully before storage/transmission.
Install Mechanism
Instruction-only skill with no install spec and no code files executed at install time — lowest installation risk.
Credentials
The skill requests no credentials or config paths. The only operational input shown is a contact email supplied at runtime; exposing a personal email in User-Agent is a privacy consideration but not disproportionate to the skill's purpose.
Persistence & Privilege
always:false and default autonomous invocation are normal. The skill does not request persistent system privileges or claim to modify other skills or system-wide settings.
Assessment
This skill appears coherent and focused on responsible web scraping, but review a few practical points before using: (1) Prefer using an official API when available and confirm you have authorization before accessing behind‑login content. (2) Use a role or monitored address (not a personal email) as the 'contact' in User‑Agent to avoid leaking personal contact details. (3) The example robots.txt helper treats a missing robots.txt as 'allowed' — decide if your policy should be more conservative. (4) Implement concrete PII detection/removal and retention/deletion policies before persisting scraped data, and keep audit logs to demonstrate good faith. (5) If you plan to run this autonomously or at scale, consult your legal/compliance team (GDPR/CCPA and contractual ToS risk) and test against a non-production target first.

Like a lobster shell, security has layers — review code before you run it.

latestvk97a6pppcchegh105cxc61we9h811xft

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Comments