{"skill":{"slug":"bright-data-claude-skill-deep-research","displayName":"Bright-Data-MCP-Claude-Skill-deep-research","summary":"This skill should be used when the user asks to \"research web data\", \"scrape websites\", \"extract web data\", \"perform market research\", \"analyze competitors\",...","description":"---\nname: research-brightdata\ndescription: This skill should be used when the user asks to \"research web data\", \"scrape websites\", \"extract web data\", \"perform market research\", \"analyze competitors\", \"monitor prices\", \"collect product information\", \"search and analyze web content\", or mentions Bright Data MCP, web scraping, web data extraction, or automated research. Provides comprehensive web research workflows using Bright Data MCP tools including search, scraping, extraction, and browser automation capabilities.\nversion: 1.0.0\n---\n\n# Bright Data Research Skill\n\nAdvanced web research powered by Bright Data MCP - perform market analysis, competitive intelligence, data extraction, and comprehensive web research with anti-bot protection.\n\n## Overview\n\nThis skill provides complete workflows for automated web research using Bright Data MCP. Handle search discovery, content collection, structured data extraction, and comprehensive analysis with browser automation support.\n\n## When This Skill Applies\n\nActivate this skill when the user's request involves:\n- Web scraping and data collection\n- Market research and competitive analysis\n- Price monitoring and comparison\n- Product information extraction\n- Search engine result analysis\n- Large-scale web data gathering\n- Research requiring anti-bot protection\n\n## Core Capabilities\n\n### Search and Discovery\n\nUse `search_engine` tool to find relevant sources:\n\n```json\n{\n  \"tool\": \"search_engine\",\n  \"parameters\": {\n    \"query\": \"site:etsy.com nba merchandise\",\n    \"engine\": \"google\",\n    \"cursor\": \"0\"\n  }\n}\n```\n\n**Search strategies:**\n- Use site operators: `\"site:etsy.com keywords\"`\n- Use exact phrases: `\"machine learning in healthcare\"`\n- Exclude terms: `\"iphone -case -cover\"`\n- Paginate with cursor: \"0\", \"1\", \"2\" for more results\n\n### Content Collection\n\nThree collection modes based on research depth:\n\n**Quick Mode** (3-5 URLs, serial processing):\n- Use `scrape_as_markdown` for each URL\n- Best for: Fast overviews, fact-checking\n\n**Standard Mode** (10-20 URLs, parallel batch):\n- Use `scrape_batch` for up to 10 URLs concurrently\n- Best for: Market research, competitive analysis\n\n**Deep Mode** (20-50 URLs, browser automation):\n- Use `scraping_browser_navigate` for JavaScript-rendered pages\n- Use `scraping_browser_links` to discover page links\n- Use `scraping_browser_click` for interactions\n- Best for: Dynamic content, multi-page extraction\n\n### Data Extraction\n\nUse `extract` tool for AI-powered structured data extraction:\n\n```json\n{\n  \"tool\": \"extract\",\n  \"parameters\": {\n    \"url\": \"https://example.com/product\",\n    \"extraction_prompt\": \"Extract: product name, price as number, rating (0-5), number of reviews, seller name, availability status\"\n  }\n}\n```\n\n**Common extraction schemas:**\n- **E-commerce**: name, price, rating, reviews, seller, availability\n- **Articles**: title, author, date, summary, key points\n- **Companies**: name, industry, founded, headquarters, employee count\n\n### Output Formats\n\nThree report formats for different use cases:\n\n**Report Format** (default):\n- Executive summary\n- Key findings with evidence\n- Detailed analysis\n- Methodology and recommendations\n- Source references\n\n**JSON Format**:\n- Structured data for API integration\n- All raw and processed data\n- Metadata and provenance\n- Statistical analysis\n\n**Markdown Format**:\n- Clean, readable content\n- Tables and lists\n- Source links\n- Minimal formatting\n\n## Research Workflow\n\n### Phase 1: Query Analysis\n\nUnderstand the research intent:\n- **Scope**: How broad/deep should research be?\n- **Key entities**: Products, companies, topics\n- **Target sources**: Which sites/platforms?\n- **Data needed**: What fields to extract?\n\n### Phase 2: Source Discovery\n\nUse `search_engine` to find URLs:\n1. Execute initial search\n2. Extract URLs from SERP\n3. Filter irrelevant domains\n4. Paginate if needed\n5. Prioritize by relevance\n\n### Phase 3: Content Collection\n\nChoose appropriate mode:\n- **Quick**: `scrape_as_markdown` per URL\n- **Standard**: `scrape_batch` 10 URLs at once\n- **Deep**: `scraping_browser_navigate` + browser tools\n\nHandle errors gracefully:\n- Retry failed URLs with alternative methods\n- Log errors for transparency\n- Continue with available data\n\n### Phase 4: Data Extraction\n\nApply extraction schema:\n- Use `extract` with custom prompts\n- Validate extracted data\n- Handle missing/malformed data\n- Ensure data quality\n\n### Phase 5: Analysis & Synthesis\n\nProcess and analyze:\n- Clean and normalize data\n- Perform statistical analysis\n- Identify patterns and trends\n- Cross-reference sources\n- Validate findings\n\n### Phase 6: Report Generation\n\nGenerate output:\n- **Report**: Comprehensive document with all sections\n- **JSON**: Structured data for processing\n- **Markdown**: Clean, readable content\n\n## Best Practices\n\n### Search Strategy\n- Start broad, then narrow down\n- Use site operators for targeted searches\n- Try multiple search engines if needed\n- Set realistic limits (10-20 URLs usually sufficient)\n\n### Performance\n- Use `scrape_batch` for parallel processing (10x faster)\n- Only use `deep` mode when necessary (much slower)\n- Set appropriate timeouts\n- Monitor success rates\n- **Avoid token limits**: Batch 1-2 URLs at a time for large pages (Etsy, Amazon, etc.)\n\n### Data Quality\n- Always validate extracted data\n- Cross-reference multiple sources\n- Check for outliers and anomalies\n- Normalize formats (dates, currencies, units)\n\n### Error Handling\n- Implement retry logic\n- Have fallback strategies\n- Log errors for debugging\n- Don't fail on individual URL errors\n\n### Ethical Considerations\n- Respect robots.txt\n- Don't overwhelm servers\n- Rate limit requests\n- Cite sources properly\n- Don't misuse personal data\n\n## Common Research Scenarios\n\n### E-commerce Market Research\n\n```\nQuery: \"site:etsy.com nba merchandise\"\nMode: standard\nExtract: product name, price, rating, reviews, seller\nOutput: report\n```\n\nExpected: Price analysis, popular products, top sellers\n\n### Price Comparison\n\n```\nQuery: \"iphone 15 pro max 256GB price comparison\"\nMode: standard\nExtract: retailer, price, availability, shipping\nOutput: json\n```\n\nExpected: Structured comparison with best deal identified\n\n### Academic Research\n\n```\nQuery: \"machine learning in healthcare 2024 papers\"\nMode: standard\nExtract: title, authors, date, key findings, methodology\nOutput: report\n```\n\nExpected: Literature review with trends and insights\n\n### Competitive Intelligence\n\n```\nQuery: \"competitor.com features pricing\"\nMode: deep\nExtract: feature name, description, pricing tier, availability\nOutput: report\n```\n\nExpected: Feature comparison, pricing analysis, recommendations\n\n## Tool Reference\n\n### search_engine\n**Purpose**: Find relevant web pages\n**Parameters**: query (required), engine (google/bing/yandex), cursor (page number)\n**Returns**: SERP results in markdown\n\n### scrape_as_markdown\n**Purpose**: Get clean, AI-ready markdown\n**Parameters**: url (required)\n**Returns**: Formatted markdown without ads/clutter\n\n### scrape_as_html\n**Purpose**: Get raw HTML\n**Parameters**: url (required)\n**Returns**: Complete HTML document\n\n### extract\n**Purpose**: AI-powered structured data extraction\n**Parameters**: url (required), extraction_prompt (optional)\n**Returns**: JSON object with extracted data\n\n### scrape_batch\n**Purpose**: Process multiple URLs in parallel\n**Parameters**: urls (array, max 10)\n**Returns**: Array of page contents\n\n### scraping_browser_navigate\n**Purpose**: Navigate JavaScript-rendered pages\n**Parameters**: url (required)\n**Returns**: Page info (title, URL, status)\n\n### scraping_browser_click\n**Purpose**: Click elements on page\n**Parameters**: selector (CSS selector)\n**Returns**: Action result\n\n### scraping_browser_links\n**Purpose**: Get all links on current page\n**Parameters**: None\n**Returns**: Array of links with text, href, selector\n\n## Troubleshooting\n\n### No search results\n- Try different search engine (bing, yandex)\n- Simplify the query\n- Check for typos\n- Use broader search terms\n\n### Scraping fails\n- URL might be JavaScript-rendered → use `mode=deep`\n- URL might be blocked → try alternative URL\n- Check if URL is accessible in browser\n\n### Extraction incomplete\n- Provide more specific extraction prompt\n- Check if data exists on page\n- Try scraping as markdown first to see content\n\n### Slow performance\n- Reduce `max_results`\n- Use `mode=standard` instead of `deep`\n- Check network connectivity\n- Close unnecessary browser sessions\n\n### Token limit exceeded\n- **Symptom**: \"Output exceeds maximum allowed tokens\" error\n- **Cause**: Batch scraping too many large pages at once OR reading large files\n- **Why this limit exists**: \n  - **Memory protection**: Prevents memory overflow from loading too much content\n  - **Performance optimization**: Ensures fast response times\n  - **Context management**: Preserves space for other content in the conversation\n  - **System stability**: Prevents crashes or errors\n- **Can this limit be increased?**: \n  - **No** - This is a hard system limit in Claude Code\n  - **Cannot be changed** via configuration files\n  - **Purpose**: Protect system stability and performance\n- **Workarounds**: \n  - **For scraping**: Reduce batch size to 1-2 URLs for large pages\n  - **For reading files**: Use `Read` with `offset` and `limit` to read in chunks\n  - **For specific content**: Use `Grep` to search for specific patterns\n  - **For finding files**: Use `Glob` to find files by pattern\n\n## Additional Resources\n\n### Reference Files\n\nFor detailed workflows and techniques:\n- **`references/search-discovery.md`** - Search strategies and URL discovery\n- **`references/content-scraping.md`** - Content collection methods\n- **`references/data-extraction.md`** - Extraction schemas and validation\n- **`references/deep-scraping.md`** - Browser automation techniques\n- **`references/analysis-report.md`** - Analysis and report generation\n\n### Example Files\n\nComplete research examples:\n- **`examples/market-research-etsy-nba.md`** - E-commerce market research\n- **`examples/competitive-analysis-pricing.md`** - Price comparison workflow\n- **`examples/academic-research-ml-healthcare.md`** - Academic literature review\n\n## Limitations\n\n- Requires Bright Data MCP server configuration\n- Needs valid Bright Data API token\n- Subject to API rate limits\n- Browser automation is slower than direct scraping\n- Some sites may still block access\n- Quality depends on source content\n\n## Progressive Disclosure\n\nThis SKILL.md provides core workflows and quick reference (approximately 2,000 words).\n\nFor detailed implementation patterns, advanced techniques, and comprehensive examples, consult the `references/` files which load as needed during research tasks.\n","topics":["Browser Automation","Data Extraction","Market Research","Web Scraping","Scrape"],"tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":592,"installsAllTime":22,"installsCurrent":1,"stars":0,"versions":1},"createdAt":1772770150954,"updatedAt":1778491745273},"latestVersion":{"version":"1.0.0","createdAt":1772770150954,"changelog":"- Initial release.","license":null},"metadata":null,"owner":{"handle":"liangdabiao","userId":"s1751q8y7zg7g4cphjx57122f183hzq4","displayName":"liangdabiao","image":"https://avatars.githubusercontent.com/u/1232260?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1780089777398}}