Web Scraper

Extract structured data from websites using browser automation. Use when scraping product listings, articles, contact info, prices, or any web content. Suppo...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 186 · 1 current installs · 1 all-time installs
byYinanping@yinanping-CPU
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
high confidence
!
Purpose & Capability
The skill claims to use browser automation (agent-browser) but the registry metadata lists no required binaries; the Python scripts call a local binary named 'agent-browser' via subprocess.run, which is not declared. SKILL.md also documents additional scripts (scrape_infinite_scroll.py, scrape_dynamic.py) that are referenced but not included in the file manifest. These mismatches mean the bundle is incomplete or undeclared dependencies exist.
!
Instruction Scope
The SKILL.md and included scripts remain focused on scraping tasks, but SKILL.md instructs using additional scripts and features (infinite scroll, dynamic interaction) that are not provided. The code executes a local binary ('agent-browser') to perform page actions; that binary will have significant control but is outside the skill bundle. The instructions also advise techniques (proxies, CAPTCHA services) that would require external services or credentials not declared here.
Install Mechanism
There is no install spec (instruction-only), which minimizes automatic installation risk. However, the presence of runnable scripts means the agent (or user) will execute local Python files that call an external binary. Because no install step fetches code, nothing is auto-downloaded by the skill, but the skill depends on external tooling that is not described.
Credentials
The skill declares no required environment variables or credentials, which is consistent with the provided scripts that only save data locally. However, SKILL.md recommends using proxies/CAPTCHA services and rotating IPs — policies that normally require credentials or configuration but none are declared. The missing declaration of the 'agent-browser' binary is the primary proportionality issue.
Persistence & Privilege
The skill does not request 'always: true' and does not declare persistence or modifications to other skills. It can be invoked by the agent autonomously (default), which is expected behavior for skills; no extra privileges are requested.
What to consider before installing
This package looks like a normal web-scraper but has important gaps you should resolve before installing or running it: (1) The Python scripts call a local binary 'agent-browser' but the skill metadata does not declare this dependency — ask the author which binary/executable is required and ensure you trust it. (2) SKILL.md references additional scripts (infinite_scroll, dynamic) that are not included; confirm whether those features exist or are intentionally omitted. (3) Because the scripts invoke an external browser automation binary via subprocess, that binary will carry network and execution privileges; verify its provenance and inspect it for unwanted behavior. (4) If you plan to follow SKILL.md guidance to use proxies or CAPTCHA-solving services, expect to supply credentials/configuration not declared here — only provide such secrets to trusted code and services. If these points are not clarified by the skill author, treat the skill as incomplete/untrusted and avoid running it on sensitive systems or with privileged credentials.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0
Download zip
latestvk9714zyp1yq88nbwpe0hxv114x82eqnv

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Web Scraper

Overview

Professional web scraping skill using agent-browser. Extract structured data from any website with support for JavaScript-rendered content, pagination, and complex selectors.

Use Cases

  • E-commerce: Product listings, prices, reviews, inventory
  • Real Estate: Property listings, prices, agent contacts
  • Job Boards: Job postings, salaries, requirements
  • News/Media: Articles, headlines, publication dates
  • Directories: Business listings, contact information
  • Competitor Monitoring: Prices, products, content changes

Quick Start

Scrape Single Page

python scripts/scrape_page.py \
  --url "https://example.com/products" \
  --fields "title= h2.title,price=.price,link=a.href" \
  --output products.csv

Scrape with Pagination

python scripts/scrape_paginated.py \
  --url "https://example.com/products?page={page}" \
  --pages 10 \
  --fields "title,price,description" \
  --output all_products.csv

Scripts

scrape_page.py

Scrape a single page or static list.

Arguments:

  • --url - Target URL
  • --fields - Field definitions (name=selector format, comma-separated)
  • --output - Output file (CSV, JSON, or XLSX)
  • --format - Output format (csv, json, xlsx)
  • --wait - Wait time for dynamic content (seconds)

Field Definition Format:

fieldname=css_selector

Examples:

title=h1.product-title
price=.price-tag
description=.product-description
image=img.product-image.src
link=a.product-link.href

scrape_paginated.py

Scrape multiple pages with pagination.

Arguments:

  • --url - URL pattern (use {page} for page number)
  • --pages - Number of pages to scrape
  • --fields - Field definitions
  • --output - Output file
  • --delay - Delay between pages (seconds)
  • --next-selector - CSS selector for "next page" button (alternative to URL pattern)

scrape_infinite_scroll.py

Scrape pages with infinite scroll loading.

Arguments:

  • --url - Target URL
  • --scrolls - Number of scroll actions
  • --fields - Field definitions
  • --output - Output file
  • --scroll-delay - Delay between scrolls (ms)

scrape_dynamic.py

Scrape JavaScript-heavy sites with custom interactions.

Arguments:

  • --url - Target URL
  • --actions - JSON file with interaction sequence
  • --fields - Field definitions
  • --output - Output file

Configuration

Actions JSON Format (for dynamic scraping)

{
  "actions": [
    {"type": "click", "selector": "#load-more"},
    {"type": "wait", "ms": 2000},
    {"type": "scroll", "direction": "down", "pixels": 500},
    {"type": "fill", "selector": "#search", "value": "keyword"},
    {"type": "press", "key": "Enter"}
  ]
}

Output Formats

CSV:

title,price,link,url
"Product A",29.99,https://...,https://...
"Product B",39.99,https://...,https://...

JSON:

[
  {
    "title": "Product A",
    "price": "29.99",
    "link": "https://...",
    "scraped_at": "2026-03-07T16:00:00"
  }
]

Excel (XLSX):

  • Same as CSV but with formatting options
  • Multiple sheets support
  • Auto-fit columns

Examples

Example 1: Scrape E-commerce Products

python scripts/scrape_paginated.py \
  --url "https://example.com/shop?page={page}" \
  --pages 5 \
  --fields "name=.product-name,price=.price,rating=.stars,reviews=.review-count,url=a.href" \
  --output products.csv \
  --delay 3

Example 2: Scrape News Articles

python scripts/scrape_page.py \
  --url "https://news-site.com/latest" \
  --fields "headline=h2.article-title,summary=.article-summary,author=.byline,date=.publish-date,url=a.read-more.href" \
  --output articles.json \
  --format json

Example 3: Scrape Job Postings

python scripts/scrape_infinite_scroll.py \
  --url "https://jobs-site.com/search" \
  --scrolls 10 \
  --fields "title=.job-title,company=.company-name,location=.location,salary=.salary,posted=.date-posted,url=a.job-link.href" \
  --output jobs.csv \
  --scroll-delay 1500

Example 4: Scrape Real Estate Listings

python scripts/scrape_paginated.py \
  --url "https://realestate.com/listings?page={page}" \
  --pages 20 \
  --fields "address=.property-address,price=.listing-price,beds=.bedrooms,baths=.bathrooms,sqft=.square-feet,url=a.property-link.href" \
  --output listings.xlsx \
  --format xlsx \
  --delay 5

Best Practices

  1. Respect robots.txt - Check and follow site rules
  2. Rate limiting - Add delays between requests (2-5s recommended)
  3. Error handling - Handle missing elements gracefully
  4. User-Agent - Use realistic browser headers
  5. Retry logic - Implement retries for failed requests
  6. Data validation - Validate extracted data before saving
  7. Storage - Save intermediate results for long scrapes

Anti-Scraping Measures

Some sites employ anti-scraping techniques:

MeasureCountermeasure
IP blockingUse proxies, rotate IPs
CAPTCHAManual solving or CAPTCHA services
Rate limitingIncrease delays, randomize timing
JavaScript challengesUse browser automation (agent-browser)
Honeypot trapsAvoid hidden fields, validate selectors

Legal Considerations

  • Public data: Generally legal to scrape
  • Terms of Service: Review site ToS before scraping
  • Copyright: Don't republish copyrighted content
  • Personal data: GDPR/privacy laws may apply
  • Commercial use: May require permission

Disclaimer: This skill is for educational purposes. Users are responsible for compliance with applicable laws and website terms.

Troubleshooting

  • Elements not found: Verify CSS selectors with browser dev tools
  • Empty results: Check if content is JavaScript-rendered (use dynamic scraping)
  • Timeout errors: Increase wait time or check network
  • Blocked requests: Add delays, rotate user agents, or use proxies
  • Incomplete data: Verify pagination or scroll handling

References

CSS Selector Guide

See references/css-selectors.md for comprehensive selector examples.

Common Website Patterns

See references/website-patterns.md for common HTML structures and selectors.

Files

4 total
Select a file
Select a file to preview.

Comments

Loading comments…