{"skill":{"slug":"youtube-scrapper","displayName":"Youtube Scrapper","summary":"A skill for discovering and scraping YouTube channels based on categories and locations without requiring API keys or login.","description":"# YouTube Channel Scraper\n\nA browser-based YouTube channel discovery and scraping tool.\n\n> Part of **[ScrapeClaw](https://www.scrapeclaw.cc/)** — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.\n\n```yaml\n---\nname: youtube-scrapper\ndescription: Discover and scrape YouTube channels from your browser.\nemoji: 📺\nversion: 1.0.2\nauthor: influenza\ntags:\n  - youtube\n  - scraping\n  - social-media\n  - channel-discovery\n  - influencer-discovery\nmetadata:\n  clawdbot:\n    requires:\n      bins:\n        - python3\n        - chromium\n\n    config:\n      stateDirs:\n        - data/output\n        - data/queue\n        - thumbnails\n      outputFormats:\n        - json\n        - csv\n---\n```\n\n## Overview\n\nThis skill provides a two-phase YouTube scraping system:\n\n1. **Channel Discovery** — Find YouTube channels via Google Search (browser-based, no API key required)\n2. **Browser Scraping** — Scrape public channel data using Playwright with anti-detection (no login required)\n\n## Features\n\n- 🔍  - Discover YouTube channels by location and category\n- 🌐  - Full browser simulation for accurate scraping\n- 🛡️  - Browser fingerprinting, human behavior simulation, and stealth scripts\n- 📊  - Channel info, subscribers, views, videos, engagement data, and media\n- 💾  - JSON export with downloaded thumbnails\n- 🔄  - Resume interrupted scraping sessions\n- ⚡  - Auto-skip unavailable channels and low-subscriber profiles\n- 🌍  - Built-in residential proxy support with 4 providers\n- 🗺️  - Regional configs for US, UK, Europe, India, Gulf, and East Asia\n\n## Usage\n\n### Agent Tool Interface\n\nFor OpenClaw agent integration, the skill provides JSON output:\n\n```bash\n# Discover YouTube channels (returns JSON queue)\npython scripts/youtube_channel_discovery.py --categories tech --locations India\n\n# Scrape from a queue file\npython scripts/youtube_channel_scraper.py --queue data/queue/your_queue_file.json\n\n# Full orchestration — discover + scrape in one go\npython scripts/youtube_orchestrator.py --config resources/scraper_config_ind.json\n```\n\n## Output Data\n\n### Channel Data Structure\n\n```json\n{\n  \"channel_name\": \"Marques Brownlee\",\n  \"channel_url\": \"https://www.youtube.com/@mkbhd\",\n  \"subscribers\": 19200000,\n  \"total_views\": 4500000000,\n  \"video_count\": 1800,\n  \"description\": \"MKBHD: Quality Tech Videos...\",\n  \"joined_date\": \"Mar 21, 2008\",\n  \"country\": \"United States\",\n  \"profile_pic_url\": \"https://...\",\n  \"profile_pic_local\": \"thumbnails/mkbhd/profile_abc123.jpg\",\n  \"banner_url\": \"https://...\",\n  \"banner_local\": \"thumbnails/mkbhd/banner_def456.jpg\",\n  \"influencer_tier\": \"mega\",\n  \"category\": \"tech\",\n  \"scrape_location\": \"New York\",\n  \"scraped_at\": \"2026-02-17T12:00:00\",\n  \"recent_videos\": [\n    {\n      \"title\": \"Galaxy S26 Ultra Review\",\n      \"url\": \"https://www.youtube.com/watch?v=...\",\n      \"views\": 5200000,\n      \"published\": \"2 days ago\",\n      \"duration\": \"14:32\",\n      \"thumbnail_url\": \"https://...\",\n      \"thumbnail_local\": \"thumbnails/mkbhd/video_0_ghi789.jpg\"\n    }\n  ]\n}\n```\n\n### Queue File Structure\n\n```json\n{\n  \"location\": \"India\",\n  \"category\": \"tech\",\n  \"total\": 20,\n  \"channels\": [\"@channel1\", \"@channel2\", \"...\"],\n  \"completed\": [\"@channel1\"],\n  \"failed\": {\"@channel3\": \"not_found\"},\n  \"current_index\": 2,\n  \"created_at\": \"2026-02-17T12:00:00\",\n  \"source\": \"google_search\"\n}\n```\n\n### Influencer Tiers\n\n| Tier  | Subscribers Range   |\n|-------|---------------------|\n| nano  | < 1,000             |\n| micro | 1,000 – 10,000      |\n| mid   | 10,000 – 100,000    |\n| macro | 100,000 – 1M        |\n| mega  | > 1,000,000         |\n\n### File Outputs\n\n- **Queue files**: `data/queue/{region}/{location}_{category}_{timestamp}.json`\n- **Scraped data**: `data/output_{region}/{channel_name}.json`\n- **Thumbnails**: `thumbnails_{region}/{channel}/profile_*.jpg`, `thumbnails_{region}/{channel}/video_*.jpg`\n- **Progress**: `data/progress/discovery_progress_{region}.json`\n\n## Configuration\n\nRegional config files live in `resources/`:\n\n```\nresources/scraper_config_us.json\nresources/scraper_config_uk.json\nresources/scraper_config_eur.json\nresources/scraper_config_ind.json\nresources/scraper_config_gulf.json\nresources/scraper_config_east.json\n```\n\nExample config (`resources/scraper_config_ind.json`):\n\n```json\n{\n  \"proxy\": {\n    \"enabled\": false,\n    \"provider\": \"brightdata\",\n    \"country\": \"\",\n    \"sticky\": true,\n    \"sticky_ttl_minutes\": 10\n  },\n  \"categories\": [\n    \"gaming\", \"tech\", \"beauty\", \"fashion\", \"fitness\",\n    \"food\", \"travel\", \"music\", \"education\", \"comedy\",\n    \"lifestyle\", \"cooking\", \"diy\", \"art\", \"finance\",\n    \"health\", \"entertainment\"\n  ],\n  \"locations\": [\n    \"India\", \"Mumbai\", \"Delhi\", \"Bangalore\", \"Hyderabad\",\n    \"Chennai\", \"Kolkata\", \"Pune\", \"Ahmedabad\", \"Jaipur\"\n  ],\n  \"max_videos_to_scrape\": 6,\n  \"headless\": false,\n  \"results_per_search\": 20,\n  \"search_delay\": [3, 7],\n  \"scrape_delay\": [2, 5],\n  \"rate_limit_wait\": 60,\n  \"max_retries\": 3\n}\n```\n\n## Filters Applied\n\nThe scraper automatically filters out:\n\n- ❌ Unavailable or terminated channels\n- ❌ Channels with < 500 subscribers (configurable)\n- ❌ Non-existent channel URLs\n- ❌ Already scraped entries (deduplication)\n- ❌ Rate-limited requests (auto-retry with backoff)\n\n## Anti-Detection\n\nThe scraper uses multiple anti-detection techniques:\n\n- **Browser fingerprinting** — Rotating fingerprint profiles (viewport, user agent, timezone, WebGL, etc.)\n- **Stealth JavaScript** — Hides `navigator.webdriver`, spoofs plugins/languages/hardware, canvas noise, fake `chrome` object\n- **Human behavior simulation** — Random delays, mouse movements, scrolling patterns\n- **Network randomization** — Variable timing between requests\n- **Request interception** — Blocks known fingerprinting and tracking scripts\n\n## Troubleshooting\n\n### No Channels Discovered\n\n- Try different location/category combinations\n- Check if Google Search is returning CAPTCHA pages\n- Run with `--headless false` to debug visually\n\n### Rate Limiting\n\n- Reduce scraping speed (increase delays in config)\n- Run during off-peak hours\n- **Use a residential proxy** (see below)\n\n### Browser Crashes\n\n- The orchestrator auto-restarts the browser every 50 channels\n- Interrupted scrapes can be resumed — queue files track progress automatically\n\n---\n\n## 🌐 Residential Proxy Support\n\n### Why Use a Residential Proxy?\n\nRunning a scraper at scale **without** a residential proxy will get your IP blocked fast. Here's why proxies are essential for long-running scrapes:\n\n| Advantage | Description |\n|-----------|-------------|\n| **Avoid IP Bans** | Residential IPs look like real household users, not data-center bots. YouTube is far less likely to flag them. |\n| **Automatic IP Rotation** | Each request (or session) gets a fresh IP, so rate-limits never stack up on one address. |\n| **Geo-Targeting** | Route traffic through a specific country/city so scraped content matches the target audience's locale. |\n| **Sticky Sessions** | Keep the same IP for a configurable window (e.g. 10 min) — critical for maintaining a consistent browsing session. |\n| **Higher Success Rate** | Rotating residential IPs deliver 95%+ success rates compared to ~30% with data-center proxies on YouTube. |\n| **Long-Running Scrapes** | Scrape thousands of channels over hours or days without interruption. |\n| **Concurrent Scraping** | Run multiple browser instances across different IPs simultaneously. |\n\n### Recommended Proxy Providers\n\nWe have affiliate partnerships with top residential proxy providers. Using these links supports continued development of this skill:\n\n| Provider | Best For | Sign Up |\n|----------|----------|---------|\n| **Bright Data** | World's largest network, 72M+ IPs, enterprise-grade | 👉 [**Get Bright Data**](https://get.brightdata.com/o1kpd2da8iv4) |\n| **IProyal** | Pay-as-you-go, 195+ countries, no traffic expiry | 👉 [**Get IProyal**](https://iproyal.com/?r=ScrapeClaw) |\n| **Storm Proxies** | Fast & reliable, developer-friendly API, competitive pricing | 👉 [**Get Storm Proxies**](https://stormproxies.com/clients/aff/go/scrapeclaw) |\n| **NetNut** | ISP-grade network, 52M+ IPs, direct connectivity | 👉 [**Get NetNut**](https://netnut.io?ref=mwrlzwv) |\n\n### Setup Steps\n\n#### 1. Get Your Proxy Credentials\n\nSign up with any provider above, then grab:\n- **Username** (from your provider dashboard)\n- **Password** (from your provider dashboard)\n- **Host** and **Port** are pre-configured per provider (or use custom)\n\n#### 2. Configure via Environment Variables\n\n```bash\nexport PROXY_ENABLED=true\nexport PROXY_PROVIDER=brightdata    # brightdata | iproyal | stormproxies | netnut | custom\nexport PROXY_USERNAME=your_user\nexport PROXY_PASSWORD=your_pass\nexport PROXY_COUNTRY=us             # optional: two-letter country code\nexport PROXY_STICKY=true            # optional: keep same IP per session\n```\n\n#### 3. Provider-Specific Host/Port Defaults\n\nThese are auto-configured when you set the `provider` name:\n\n| Provider | Host | Port |\n|----------|------|------|\n| Bright Data | `brd.superproxy.io` | `22225` |\n| IProyal | `proxy.iproyal.com` | `12321` |\n| Storm Proxies | `rotating.stormproxies.com` | `9999` |\n| NetNut | `gw-resi.netnut.io` | `5959` |\n\nOverride with `PROXY_HOST` / `PROXY_PORT` env vars if your plan uses a different gateway.\n\n#### 4. Custom Proxy Provider\n\nFor any other proxy service, set provider to `custom` and supply host/port manually:\n\n```json\n{\n  \"proxy\": {\n    \"enabled\": true,\n    \"provider\": \"custom\",\n    \"host\": \"your.proxy.host\",\n    \"port\": 8080,\n    \"username\": \"user\",\n    \"password\": \"pass\"\n  }\n}\n```\n\n### Running the Scraper with Proxy\n\nOnce configured, the scraper picks up the proxy automatically — no extra flags needed:\n\n```bash\n# Discover and scrape as usual — proxy is applied automatically\npython scripts/youtube_orchestrator.py --config resources/scraper_config_ind.json\n\n# The log will confirm proxy is active:\n# INFO - Proxy enabled: <ProxyManager provider=brightdata enabled host=brd.superproxy.io:22225>\n# INFO - Browser using proxy: brightdata → brd.superproxy.io:22225\n```\n\n### Using the Proxy Manager Programmatically\n\n```python\nfrom proxy_manager import ProxyManager\n\n# From config (auto-reads config from resources/)\npm = ProxyManager.from_config()\n\n# From environment variables\npm = ProxyManager.from_env()\n\n# Manual construction\npm = ProxyManager(\n    provider=\"brightdata\",\n    username=\"your_user\",\n    password=\"your_pass\",\n    country=\"us\",\n    sticky=True\n)\n\n# For Playwright browser context\nproxy = pm.get_playwright_proxy()\n# → {\"server\": \"http://brd.superproxy.io:22225\", \"username\": \"user-country-us-session-abc123\", \"password\": \"pass\"}\n\n# For requests / aiohttp\nproxies = pm.get_requests_proxy()\n# → {\"http\": \"http://user:pass@host:port\", \"https\": \"http://user:pass@host:port\"}\n\n# Force new IP (rotates session ID)\npm.rotate_session()\n\n# Debug info\nprint(pm.info())\n```\n\n### Best Practices for Long-Running Scrapes\n\n1. **Use sticky sessions** — YouTube requires consistent IPs during a browsing session. Set `\"sticky\": true`.\n2. **Target the right country** — Set `\"country\": \"us\"` (or your target region) so YouTube serves content in the expected locale.\n3. **Combine with existing anti-detection** — This scraper already has fingerprinting, stealth scripts, and human behavior simulation. The proxy is the final layer.\n4. **Rotate sessions between batches** — Call `pm.rotate_session()` between large batches of channels to get a fresh IP.\n5. **Use delays** — Even with proxies, respect `scrape_delay` in config (default 2-5s) to avoid aggressive patterns.\n6. **Monitor your proxy dashboard** — All providers have dashboards showing bandwidth usage and success rates.\n\n## Notes\n\n- **No login required** — Only scrapes publicly visible content\n- **Checkpoint/resume** — Queue files track progress; interrupted scrapes can be resumed automatically\n- **Rate limiting** — Waits 60s on rate limit, exponential backoff on consecutive failures\n- **Resilient orchestration** — Auto-restarts browser, retries failed channels, graceful shutdown on SIGINT/SIGTERM\n- **Regional configs** — Pre-built configs for 6 regions covering 200+ cities worldwide\n","tags":{"latest":"0.1.1"},"stats":{"comments":0,"downloads":1484,"installsAllTime":0,"installsCurrent":0,"stars":2,"versions":2},"createdAt":1770904773809,"updatedAt":1778990074864},"latestVersion":{"version":"0.1.1","createdAt":1772106039809,"changelog":"- Major documentation update: SKILL.md completely rewritten and expanded.\n- Clearer overview of capabilities, features, configuration, and anti-detection strategies.\n- Structured usage examples and JSON output samples for easy agent integration.\n- Enhanced description of residential proxy support with recommendations.\n- README.md removed (consolidated into SKILL.md).","license":null},"metadata":null,"owner":{"handle":"arulmozhiv","userId":"s17d6p47jmzs92ja3p6wh80wx983h93g","displayName":"Arulmozhi Vajjiravelu","image":"https://avatars.githubusercontent.com/u/58811343?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1779972574264}}