{"skill":{"slug":"sea2049-scrapling-skill","displayName":"scrapling-skill","summary":"Use this skill whenever the user asks to scrape a website, extract structured data from web pages, handle anti-bot/Cloudflare pages, crawl multiple pages, or...","description":"---\r\nname: scrapling\r\ndescription: Use this skill whenever the user asks to scrape a website, extract structured data from web pages, handle anti-bot/Cloudflare pages, crawl multiple pages, or explicitly mentions Scrapling. This skill provides a practical Scrapling workflow (install, fetcher selection, extraction, and crawl patterns) for reliable Python web scraping.\r\n---\r\n\r\n# Scrapling Web Scraping Skill\r\n\r\n## Goal\r\n\r\nUse Scrapling to extract web data with minimal selector breakage and better anti-bot resilience.\r\n\r\nPrefer this skill when users ask for:\r\n\r\n- website scraping\r\n- data extraction from HTML pages\r\n- Cloudflare/anti-bot resistant scraping\r\n- multi-page crawling\r\n- converting scraping tasks into reusable Python scripts\r\n\r\n## Safety and Legality\r\n\r\nBefore scraping, always:\r\n\r\n1. Confirm the target is allowed by user intent and local laws.\r\n2. Avoid unauthorized access, login bypass, or private data scraping.\r\n3. Respect target website terms and reasonable request rates.\r\n4. For high-volume jobs, add delays and domain-level throttling.\r\n\r\n## Default Environment (this machine)\r\n\r\nAll dependencies should live under `D:\\clawtest`.\r\n\r\nRecommended setup commands:\r\n\r\n```powershell\r\npython -m venv D:\\clawtest\\.venv\r\nD:\\clawtest\\.venv\\Scripts\\python -m pip install -U pip\r\nD:\\clawtest\\.venv\\Scripts\\python -m pip install \"scrapling[fetchers]\"\r\nD:\\clawtest\\.venv\\Scripts\\scrapling install\r\n```\r\n\r\nNotes:\r\n\r\n- If the task is simple static HTML extraction, `pip install scrapling` is enough.\r\n- `scrapling install` is needed for browser-based fetchers.\r\n\r\n## Fetcher Selection Guide\r\n\r\nChoose the lightest option that works:\r\n\r\n1. `Fetcher`:\r\n   - Best for static pages and speed.\r\n2. `StealthyFetcher`:\r\n   - Best default when anti-bot checks likely exist.\r\n3. `DynamicFetcher`:\r\n   - Use when data is rendered by JavaScript.\r\n4. `Spider`:\r\n   - Use for multi-page crawl, queueing, concurrency, and structured export.\r\n\r\n## Standard Workflow\r\n\r\n1. Identify target fields and output schema first.\r\n2. Pick fetcher (`Fetcher` -> `StealthyFetcher` -> `DynamicFetcher` escalation).\r\n3. Extract with CSS/XPath and normalize into JSON-friendly fields.\r\n4. Save data to JSON/JSONL/CSV.\r\n5. Add retry, timeout, and polite delays for production.\r\n\r\n## Code Templates\r\n\r\n### 1) Single Page Extraction (Stealthy default)\r\n\r\n```python\r\nfrom scrapling.fetchers import StealthyFetcher\r\n\r\nStealthyFetcher.adaptive = True\r\nurl = \"https://example.com/products\"\r\npage = StealthyFetcher.fetch(url, headless=True, network_idle=True, timeout=45000)\r\n\r\nitems = []\r\nfor card in page.css(\".product-card\", auto_save=True):\r\n    items.append({\r\n        \"title\": card.css(\"h2::text\").get(default=\"\").strip(),\r\n        \"price\": card.css(\".price::text\").get(default=\"\").strip(),\r\n        \"url\": card.css(\"a::attr(href)\").get(default=\"\")\r\n    })\r\n\r\nprint(items)\r\n```\r\n\r\n### 2) Adaptive Re-location for changed layouts\r\n\r\n```python\r\n# First run stores fingerprints:\r\nproducts = page.css(\".product-card\", auto_save=True)\r\n\r\n# Future run can recover after layout drift:\r\nproducts = page.css(\".product-card\", adaptive=True)\r\n```\r\n\r\n### 3) Spider Crawl Skeleton\r\n\r\n```python\r\nfrom scrapling.spiders import Spider, Response\r\n\r\nclass ProductSpider(Spider):\r\n    name = \"product_spider\"\r\n    start_urls = [\"https://example.com/catalog\"]\r\n\r\n    async def parse(self, response: Response):\r\n        for card in response.css(\".product-card\"):\r\n            yield {\r\n                \"title\": card.css(\"h2::text\").get(default=\"\").strip(),\r\n                \"price\": card.css(\".price::text\").get(default=\"\").strip(),\r\n            }\r\n\r\n        for href in response.css(\"a.next::attr(href)\").all():\r\n            yield response.follow(href, callback=self.parse)\r\n\r\nif __name__ == \"__main__\":\r\n    ProductSpider().start()\r\n```\r\n\r\n## Expected Assistant Output Format\r\n\r\nWhen executing a user task with this skill, respond with:\r\n\r\n1. chosen fetcher/spider strategy and why\r\n2. runnable script (or patch) tailored to target site\r\n3. exact install/run commands for current machine\r\n4. output path and data schema\r\n5. anti-bot reliability notes and fallback plan\r\n\r\n## Practical Fallback Order\r\n\r\nIf extraction fails:\r\n\r\n1. Validate selectors on fresh HTML.\r\n2. Switch `Fetcher` -> `StealthyFetcher`.\r\n3. Switch to `DynamicFetcher` for JS-rendered content.\r\n4. Add adaptive selectors (`auto_save=True` then `adaptive=True`).\r\n5. Add retries, backoff, and lower request rate.\r\n","tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":656,"installsAllTime":25,"installsCurrent":1,"stars":1,"versions":1},"createdAt":1772871335978,"updatedAt":1778491761092},"latestVersion":{"version":"1.0.0","createdAt":1772871335978,"changelog":"Initial public release of Scrapling skill.","license":null},"metadata":null,"owner":{"handle":"sea2049","userId":"s1764rp6tjx6p94bfr0mnzq5nd884wdg","displayName":"Sea2049","image":"https://avatars.githubusercontent.com/u/158456380?v=4"},"moderation":null}