Deep Scraper

v1.0.1

Performs deep scraping of complex sites like YouTube using containerized Crawlee, extracting validated, ad-free transcripts and content as JSON output.

9· 9.4k·63 current·70 all-time
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
Purpose & Capability
The name and SKILL.md describe a deep web scraper for dynamic sites (YouTube/X) and the included JS files implement Playwright/Crawlee logic to capture youtube timedtext and page text. Required resources (Docker, Playwright) align with that purpose; there are no unrelated credentials or binaries requested.
!
Instruction Scope
SKILL.md instructs building a Docker image (tag clawd-crawlee) and insists a Dockerfile remain in the skill directory, but the provided file manifest does not include any Dockerfile. The build/run instructions therefore do not match the shipped files. The runtime instructions also direct network interception (page.on('request')) which can capture more than just transcripts if the page issues sensitive requests — the guidance does not limit or sanitize that capability.
Install Mechanism
There is no install spec (instruction-only), which is lower-risk, but package.json declares heavy dependencies (crawlee, playwright) and the SKILL.md expects a containerized image. Because no Dockerfile is present, it's unclear how the container is built; that gap needs resolving before running any install/build steps.
Credentials
The skill requests no environment variables or credentials (proportionate), however its network-interception logic (listening to all requests and then fetch()-ing intercepted URLs) could capture request URLs or payloads that include tokens or other sensitive data from target pages. No explicit exfiltration endpoints are present — output is printed to stdout — but running this against authenticated or private pages risks exposing secrets.
Persistence & Privilege
The skill does not request permanent presence (always: false), does not modify other skills or system settings, and is user-invocable. It does require Docker to run containers, which is a normal privilege for this type of tool.
What to consider before installing
Key things to consider before installing or running this skill: - Missing Dockerfile: SKILL.md requires building a Docker image from the skill directory but no Dockerfile is included. Do not build/run unverified images — ask the publisher for the Dockerfile or a reproducible build spec and review it before building. - Inspect the image and container: if you build a container, review its Dockerfile and resulting image layers and run it in an isolated environment (non-production host, sandbox, or VM) to limit blast radius. - Network-interception risks: the code listens to all network requests and will fetch intercepted URLs. That can unintentionally capture request URLs or payloads containing authorization tokens or other sensitive data. Only run this against public content and avoid logged-in sessions; consider network restrictions while testing. - Legal and policy risk: automated scraping that 'penetrates protections' or bypasses rate limits may violate site terms of service or local law. Confirm you have the right to scrape the target site and transcripts. - Dependency and resource needs: Playwright requires browser binaries; the container will be heavy. Ensure your environment can safely run headless browsers (no privileged mounts, limited access). - Suggested actions: request the missing Dockerfile, review it and package.json, run the skill in a network-isolated sandbox, and limit input to public URLs only. If you need a lower-risk alternative, prefer a tool that uses official public APIs (with explicit API keys) rather than network interception.

Like a lobster shell, security has layers — review code before you run it.

latestvk97bwe7p2sxt1a5xmen45rwgwn80haf5

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Comments