Deep Scraper
Performs deep scraping of complex sites like YouTube using containerized Crawlee, extracting validated, ad-free transcripts and content as JSON output.
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 7 · 7.6k · 45 current installs · 50 all-time installs
by@opsun
MIT-0
Security Scan
OpenClaw
Suspicious
medium confidencePurpose & Capability
The name and SKILL.md describe a deep web scraper for dynamic sites (YouTube/X) and the included JS files implement Playwright/Crawlee logic to capture youtube timedtext and page text. Required resources (Docker, Playwright) align with that purpose; there are no unrelated credentials or binaries requested.
Instruction Scope
SKILL.md instructs building a Docker image (tag clawd-crawlee) and insists a Dockerfile remain in the skill directory, but the provided file manifest does not include any Dockerfile. The build/run instructions therefore do not match the shipped files. The runtime instructions also direct network interception (page.on('request')) which can capture more than just transcripts if the page issues sensitive requests — the guidance does not limit or sanitize that capability.
Install Mechanism
There is no install spec (instruction-only), which is lower-risk, but package.json declares heavy dependencies (crawlee, playwright) and the SKILL.md expects a containerized image. Because no Dockerfile is present, it's unclear how the container is built; that gap needs resolving before running any install/build steps.
Credentials
The skill requests no environment variables or credentials (proportionate), however its network-interception logic (listening to all requests and then fetch()-ing intercepted URLs) could capture request URLs or payloads that include tokens or other sensitive data from target pages. No explicit exfiltration endpoints are present — output is printed to stdout — but running this against authenticated or private pages risks exposing secrets.
Persistence & Privilege
The skill does not request permanent presence (always: false), does not modify other skills or system settings, and is user-invocable. It does require Docker to run containers, which is a normal privilege for this type of tool.
What to consider before installing
Key things to consider before installing or running this skill:
- Missing Dockerfile: SKILL.md requires building a Docker image from the skill directory but no Dockerfile is included. Do not build/run unverified images — ask the publisher for the Dockerfile or a reproducible build spec and review it before building.
- Inspect the image and container: if you build a container, review its Dockerfile and resulting image layers and run it in an isolated environment (non-production host, sandbox, or VM) to limit blast radius.
- Network-interception risks: the code listens to all network requests and will fetch intercepted URLs. That can unintentionally capture request URLs or payloads containing authorization tokens or other sensitive data. Only run this against public content and avoid logged-in sessions; consider network restrictions while testing.
- Legal and policy risk: automated scraping that 'penetrates protections' or bypasses rate limits may violate site terms of service or local law. Confirm you have the right to scrape the target site and transcripts.
- Dependency and resource needs: Playwright requires browser binaries; the container will be heavy. Ensure your environment can safely run headless browsers (no privileged mounts, limited access).
- Suggested actions: request the missing Dockerfile, review it and package.json, run the skill in a network-isolated sandbox, and limit input to public URLs only. If you need a lower-risk alternative, prefer a tool that uses official public APIs (with explicit API keys) rather than network interception.Like a lobster shell, security has layers — review code before you run it.
Current versionv1.0.1
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
SKILL.md
Skill: deep-scraper
Overview
A high-performance engineering tool for deep web scraping. It uses a containerized Docker + Crawlee (Playwright) environment to penetrate protections on complex websites like YouTube and X/Twitter, providing "interception-level" raw data.
Requirements
- Docker: Must be installed and running on the host machine.
- Image: Build the environment with the tag
clawd-crawlee.- Build command:
docker build -t clawd-crawlee skills/deep-scraper/
- Build command:
Integration Guide
Simply copy the skills/deep-scraper directory into your skills/ folder. Ensure the Dockerfile remains within the skill directory for self-contained deployment.
Standard Interface (CLI)
docker run -t --rm -v $(pwd)/skills/deep-scraper/assets:/usr/src/app/assets clawd-crawlee node assets/main_handler.js [TARGET_URL]
Output Specification (JSON)
The scraping results are printed to stdout as a JSON string:
status: SUCCESS | PARTIAL | ERRORtype: TRANSCRIPT | DESCRIPTION | GENERICvideoId: (For YouTube) The validated Video ID.data: The core text content or transcript.
Core Rules
- ID Validation: All YouTube tasks MUST verify the Video ID to prevent cache contamination.
- Privacy: Strictly forbidden from scraping password-protected or non-public personal information.
- Alpha-Focused: Automatically strips ads and noise, delivering pure data optimized for LLM processing.
Files
4 totalSelect a file
Select a file to preview.
Comments
Loading comments…
