Install
openclaw skills install yinan-web-scraperExtract structured data from websites using browser automation. Use when scraping product listings, articles, contact info, prices, or any web content. Supports single pages, pagination, infinite scroll, and dynamic content. Outputs to CSV, JSON, or Excel.
openclaw skills install yinan-web-scraperProfessional web scraping skill using agent-browser. Extract structured data from any website with support for JavaScript-rendered content, pagination, and complex selectors.
python scripts/scrape_page.py \
--url "https://example.com/products" \
--fields "title= h2.title,price=.price,link=a.href" \
--output products.csv
python scripts/scrape_paginated.py \
--url "https://example.com/products?page={page}" \
--pages 10 \
--fields "title,price,description" \
--output all_products.csv
Scrape a single page or static list.
Arguments:
--url - Target URL--fields - Field definitions (name=selector format, comma-separated)--output - Output file (CSV, JSON, or XLSX)--format - Output format (csv, json, xlsx)--wait - Wait time for dynamic content (seconds)Field Definition Format:
fieldname=css_selector
Examples:
title=h1.product-title
price=.price-tag
description=.product-description
image=img.product-image.src
link=a.product-link.href
Scrape multiple pages with pagination.
Arguments:
--url - URL pattern (use {page} for page number)--pages - Number of pages to scrape--fields - Field definitions--output - Output file--delay - Delay between pages (seconds)--next-selector - CSS selector for "next page" button (alternative to URL pattern)Scrape pages with infinite scroll loading.
Arguments:
--url - Target URL--scrolls - Number of scroll actions--fields - Field definitions--output - Output file--scroll-delay - Delay between scrolls (ms)Scrape JavaScript-heavy sites with custom interactions.
Arguments:
--url - Target URL--actions - JSON file with interaction sequence--fields - Field definitions--output - Output file{
"actions": [
{"type": "click", "selector": "#load-more"},
{"type": "wait", "ms": 2000},
{"type": "scroll", "direction": "down", "pixels": 500},
{"type": "fill", "selector": "#search", "value": "keyword"},
{"type": "press", "key": "Enter"}
]
}
CSV:
title,price,link,url
"Product A",29.99,https://...,https://...
"Product B",39.99,https://...,https://...
JSON:
[
{
"title": "Product A",
"price": "29.99",
"link": "https://...",
"scraped_at": "2026-03-07T16:00:00"
}
]
Excel (XLSX):
python scripts/scrape_paginated.py \
--url "https://example.com/shop?page={page}" \
--pages 5 \
--fields "name=.product-name,price=.price,rating=.stars,reviews=.review-count,url=a.href" \
--output products.csv \
--delay 3
python scripts/scrape_page.py \
--url "https://news-site.com/latest" \
--fields "headline=h2.article-title,summary=.article-summary,author=.byline,date=.publish-date,url=a.read-more.href" \
--output articles.json \
--format json
python scripts/scrape_infinite_scroll.py \
--url "https://jobs-site.com/search" \
--scrolls 10 \
--fields "title=.job-title,company=.company-name,location=.location,salary=.salary,posted=.date-posted,url=a.job-link.href" \
--output jobs.csv \
--scroll-delay 1500
python scripts/scrape_paginated.py \
--url "https://realestate.com/listings?page={page}" \
--pages 20 \
--fields "address=.property-address,price=.listing-price,beds=.bedrooms,baths=.bathrooms,sqft=.square-feet,url=a.property-link.href" \
--output listings.xlsx \
--format xlsx \
--delay 5
Some sites employ anti-scraping techniques:
| Measure | Countermeasure |
|---|---|
| IP blocking | Use proxies, rotate IPs |
| CAPTCHA | Manual solving or CAPTCHA services |
| Rate limiting | Increase delays, randomize timing |
| JavaScript challenges | Use browser automation (agent-browser) |
| Honeypot traps | Avoid hidden fields, validate selectors |
Disclaimer: This skill is for educational purposes. Users are responsible for compliance with applicable laws and website terms.
See references/css-selectors.md for comprehensive selector examples.
See references/website-patterns.md for common HTML structures and selectors.