Web Scraper as a Service
Build client-ready web scrapers with clean data output. Use when creating scrapers for clients, extracting data from websites, or delivering scraping projects.
Like a lobster shell, security has layers — review code before you run it.
License
SKILL.md
Web Scraper as a Service
Turn scraping briefs into deliverable scraping projects. Generates the scraper, runs it, cleans the data, and packages everything for the client.
How to Use
/web-scraper-as-a-service "Scrape all products from example-store.com — need name, price, description, images. CSV output."
/web-scraper-as-a-service https://example.com --fields "title,price,rating,url" --format csv
/web-scraper-as-a-service brief.txt
Scraper Generation Pipeline
Step 1: Analyze the Target
Before writing any code:
- Fetch the target URL to understand the page structure
- Identify:
- Is the site server-rendered (static HTML) or client-rendered (JavaScript/SPA)?
- What anti-scraping measures are visible? (Cloudflare, CAPTCHAs, rate limits)
- Pagination pattern (URL params, infinite scroll, load more button)
- Data structure (product cards, table rows, list items)
- Total estimated volume (number of pages/items)
- Choose the right tool:
- Static HTML → Python +
requests+BeautifulSoup - JavaScript-rendered → Python +
playwright - API available → Direct API calls (check network tab patterns)
- Static HTML → Python +
Step 2: Build the Scraper
Generate a complete Python script in scraper/ directory:
scraper/
scrape.py # Main scraper script
requirements.txt # Dependencies
config.json # Target URLs, fields, settings
README.md # Setup and usage instructions for client
scrape.py must include:
# Required features in every scraper:
# 1. Configuration
import json
config = json.load(open('config.json'))
# 2. Rate limiting (ALWAYS — be respectful)
import time
DELAY_BETWEEN_REQUESTS = 2 # seconds, adjustable in config
# 3. Retry logic
MAX_RETRIES = 3
RETRY_DELAY = 5
# 4. User-Agent rotation
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
# ... at least 5 user agents
]
# 5. Progress tracking
print(f"Scraping page {current}/{total} — {items_collected} items collected")
# 6. Error handling
# - Log errors but don't crash on individual page failures
# - Save progress incrementally (don't lose data on crash)
# - Write errors to error_log.txt
# 7. Output
# - Save data incrementally (append to file, don't hold in memory)
# - Support CSV and JSON output
# - Clean and normalize data before saving
# 8. Resume capability
# - Track last successfully scraped page/URL
# - Can resume from where it left off if interrupted
Step 3: Data Cleaning
After scraping, clean the data:
- Remove duplicates (by unique identifier or composite key)
- Normalize text (strip extra whitespace, fix encoding issues, consistent capitalization)
- Validate data (no empty required fields, prices are numbers, URLs are valid)
- Standardize formats (dates to ISO 8601, currency to numbers, consistent units)
- Generate data quality report:
Data Quality Report ─────────────────── Total records: 2,487 Duplicates removed: 13 Empty fields filled: 0 Fields with issues: price (3 records had non-numeric values — cleaned) Completeness: 99.5%
Step 4: Client Deliverable Package
Generate a complete deliverable:
delivery/
data.csv # Clean data in requested format
data.json # JSON alternative
data-quality-report.md # Quality metrics
scraper-documentation.md # How the scraper works
README.md # Quick start guide
scraper-documentation.md includes:
- What was scraped and from where
- How many records collected
- Data fields and their descriptions
- How to re-run the scraper
- Known limitations
- Date of scraping
Step 5: Output to User
Present:
- Summary: X records scraped from Y pages, Z% data quality
- Sample data: First 5 rows of the output
- File locations: Where the deliverables are saved
- Client handoff notes: What to tell the client about the data
Scraper Templates
Based on the target type, use the appropriate template:
E-commerce Product Scraper
Fields: name, price, original_price, discount, description, images, category, sku, rating, review_count, availability, url
Real Estate Listings
Fields: address, price, bedrooms, bathrooms, sqft, lot_size, listing_type, agent, description, images, url
Job Listings
Fields: title, company, location, salary, job_type, description, requirements, posted_date, url
Directory/Business Listings
Fields: business_name, address, phone, website, category, rating, review_count, hours, description
News/Blog Articles
Fields: title, author, date, content, tags, url, image
Ethical Scraping Rules
- Always respect robots.txt — check before scraping
- Rate limit — minimum 2 second delay between requests
- Identify yourself — use realistic but honest User-Agent
- Don't scrape personal data (emails, phone numbers) unless explicitly authorized by the client AND the data is publicly displayed
- Cache responses — don't re-scrape pages unnecessarily
- Check ToS — note if the site's terms prohibit scraping and inform the client
Files
1 totalComments
Loading comments…
