{"skill":{"slug":"python-web-scraper","displayName":"Web Scraper","summary":"Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JavaScript-heavy sites, and structured output (J...","description":"---\nname: web-scraper\ndescription: Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JavaScript-heavy sites, and structured output (JSON/CSV). Use when Codex needs to extract data from websites, handle pagination, bypass simple anti-bot measures, scrape JavaScript-rendered content, or process scraped data into usable formats.\n---\n\n# Web Scraper\n\n## Overview\n\nPython web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JS-heavy sites, and structured output. Covers ethical scraping practices. Use when Codex needs to extract data from websites, handle pagination, bypass simple anti-bot measures, or scrape JavaScript-rendered content.\n\n## Quick Start\n\n### Prerequisites\n```bash\npip install requests beautifulsoup4 lxml\n# For JS-heavy sites:\npip install selenium webdriver-manager\n```\n\n### Basic scrape\n```bash\n# Extract all links from a page\npython3 scripts/scrape-basic.py https://example.com \\\n  --selector \"a[href]\" --attr href --output links.json --pretty\n\n# Extract text from articles\npython3 scripts/scrape-basic.py https://news.ycombinator.com \\\n  --selector \".titleline a\" --output hn.txt\n```\n\n### Paginated scrape\n```bash\n# URL parameter pagination (?page=1, ?page=2)\npython3 scripts/scrape-pagination.py https://books.toscrape.com/catalogue/page-1.html \\\n  --selector \"h3 a\" --attr title --max-pages 5\n\n# Next-link detection\npython3 scripts/scrape-pagination.py https://quotes.toscrape.com \\\n  --selector \"span.text\" --max-pages 3\n```\n\n### JavaScript-rendered pages (Selenium)\n```bash\npython3 scripts/scrape-with-selenium.py https://example.com \\\n  --selector \".dynamic-content\" --wait 5 --output data.json\n```\n\n## Common Scenarios\n\n### Anti-blocking techniques\n\nRotate User-Agents and add delays to avoid 429/blocking:\n\n```python\nimport random\nimport time\nheaders = {\n    \"User-Agent\": random.choice(USER_AGENTS),\n    \"Accept\": \"text/html,application/xhtml+xml\",\n    \"Accept-Language\": \"en-US,en;q=0.9\",\n    \"Referer\": \"https://www.google.com/\",\n}\ntime.sleep(random.uniform(1.0, 3.0))  # random delay between requests\n```\n\nFor aggressive blocking: set cookies, use sessions, or add proxy.\n\n### Handle JavaScript sites without Selenium\n\nFirst check: is the data embedded in the page source?\n```python\nimport re, json\n# Look for JSON data in <script> tags\nmatch = re.search(r'window\\.__INITIAL_STATE__\\s*=\\s*({.*?});', html)\nif match:\n    data = json.loads(match.group(1))\n```\n\nMany SPAs (React/Vue) embed data in script tags — Selenium may be unnecessary.\n\n### Handle login-protected pages\n\n```bash\n# Option 1: Export cookies from browser\n# In browser console: document.cookie or use EditThisCookie extension\n# Option 2: Use requests Session\npython3 -c \"\nimport requests\ns = requests.Session()\ns.post('https://example.com/login', data={'user': '...', 'pass': '...'})\nwith open('cookies.txt', 'w') as f:\n    f.write(str(s.cookies.get_dict()))\n\"\n```\n\n### Output formatting\n\nScripts output JSON by default. Convert to CSV:\n```bash\n# JSON → CSV using jq\npython3 scrape-basic.py https://example.com -s \"tr\" -o data.json --pretty\npython3 -c \"\nimport json, csv\nwith open('data.json') as f:\n    data = json.load(f)\nwith open('data.csv', 'w', newline='') as f:\n    w = csv.writer(f)\n    w.writerow(['item'])\n    for d in data:\n        w.writerow([d])\n\"\n```\n\n## Ethics & Legal\n\n- Always check `robots.txt` first: `https://example.com/robots.txt`\n- Respect `Crawl-delay` directive\n- Identify yourself in User-Agent with contact info\n- Never scrape login-protected content, personal data, or copyrighted material\n- Add delays (1-3s minimum) between requests — don't hammer servers\n- Check ToS, some sites explicitly ban scraping\n- For public data (news, blogs, directories): generally fine with proper rate limiting\n\n## Resources\n\n- **`scripts/scrape-basic.py`** — Single page scrape with CSS selectors, JSON/CSV/text output\n- **`scripts/scrape-pagination.py`** — Paginated scrape (URL params + next-link detection)\n- **`scripts/scrape-with-selenium.py`** — Selenium-based scrape for JS-heavy sites with scroll\n- **`references/anti-blocking.md`** — Detailed anti-blocking and proxy strategies\n","tags":{"latest":"1.0.1"},"stats":{"comments":0,"downloads":378,"installsAllTime":0,"installsCurrent":0,"stars":0,"versions":2},"createdAt":1778202412087,"updatedAt":1779076249834},"latestVersion":{"version":"1.0.1","createdAt":1778206761891,"changelog":"Fix: --output - now correctly prints to stdout instead of creating a '-' file","license":"MIT-0"},"metadata":null,"owner":{"handle":"ericlooi504","userId":"s1728b3jrtnnagbdxjy2rmpndh84mk8e","displayName":"ericlooi504","image":"https://avatars.githubusercontent.com/u/275256771?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1780090759543}}