test_skill

Web crawler using BFS and anti-scraping to extract and save structured BBC and general news content in Markdown with multi-site and dedup support.

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 246 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
Name/description (BBC-focused universal crawler with anti-scraping fallbacks) match the included code and scripts: a multi-method crawler (crawl4ai, playwright, requests), deduping, image download, and Markdown output. Minor inconsistencies (README mentions Python 3.8+, SKILL.md says 3.9+) do not change purpose.
Instruction Scope
SKILL.md instructs only to install Python deps and run the crawler with CLI flags. It does not instruct reading unrelated local files or environment secrets, nor does it send collected data to unexpected endpoints (the code crawls target sites and writes local files). The crawler will perform network requests to target websites as expected.
Install Mechanism
No platform install spec declared in registry, but repository includes install.py / install_dependencies.sh that run pip install -r requirements.txt and run 'python -m playwright install chromium'. Dependencies are fetched via pip and Playwright's browser install (standard mechanisms). Note: crawl4ai is a third‑party package (no pinned source) and Playwright will download browser binaries from the web—recommend verifying packages and running installs in an isolated environment.
Credentials
The skill declares no required environment variables, credentials, or config paths. Code does not read secrets or request unrelated credentials. Dependencies may later require credentials (e.g., if some optional third-party services are used), so check upstream package docs.
Persistence & Privilege
Skill is not always-enabled and does not request elevated platform privileges. It writes lock files and output data under its working directory only. No modifications to other skills or global agent settings are present.
Scan Findings in Context
[NO_FINDINGS] expected: Static regex scan found no flagged patterns. That aligns with the package being a normal crawler; absence of findings does not guarantee safety—third-party packages and dynamic behavior (Playwright downloads) should still be reviewed.
Assessment
This package appears to be a coherent web crawler. Before installing or running it: 1) run pip installs and Playwright browser installs in a virtualenv or sandbox (not as root) to avoid system package conflicts; 2) review the requirements (especially 'crawl4ai') and verify their provenance and any credentials they might require; 3) be mindful of legal/ethical rules: respect robots.txt and site terms, and avoid aggressive crawling—use delays and domain restrictions; 4) if you need higher assurance, inspect the full universal_crawler_v2.py (the provided file was truncated) and run the code in an isolated network environment to observe outbound connections made by dependencies.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.9
Download zip
latestvk971vn10sfp17vda2m7mpwgrv982qa1g

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

BBC Crawler MaxClaw

Description

A powerful, universal web crawler optimized for BBC News but capable of crawling other sites. It integrates advanced scraping technologies including Crawl4AI and Playwright to handle dynamic content and anti-bot protections.

Features

  • Multi-Method Extraction:
    • crawl4ai: Primary method using AsyncWebCrawler for high performance and accuracy.
    • playwright: Full browser rendering fallback for complex dynamic pages.
    • requests: Fast fallback for static content.
    • auto: Automatically detects the best method (Prioritizes Crawl4AI).
  • Hierarchical Storage: Saves content in a structured format: YYYY-MM-DD/Category/Title.md.
  • Local Image Archiving: Downloads images locally, names them by MD5 hash, and updates Markdown references.
  • Content Filtering: Intelligently extracts main article content and relevant images using CSS selectors.

Requirements

  • Python 3.9+
  • See requirements.txt for Python packages.

Installation

# 1. Install dependencies
# Note: install.py supports passing arguments to pip, e.g., --break-system-packages
python install.py

# Example for environments requiring --break-system-packages:
python install.py --break-system-packages

Usage

Basic Usage

python universal_crawler_v2.py --url https://www.bbc.co.uk/news --max-pages 50

Advanced Usage

# Force Crawl4AI
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --method crawl4ai

# Force Playwright
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --method playwright

# Control depth and delay
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --depth 3 --delay 2.5

# Specify output directory
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --output ./my_data

Troubleshooting

  • Import Errors: If you see "No module named 'crawl4ai'" or similar, run python install.py again.
  • Empty Responses: Ensure you have the latest version of the crawler. Some sites may block specific IPs or user agents; try increasing delay or switching methods.

Files

11 total
Select a file
Select a file to preview.

Comments

Loading comments…