Tavily Crawl

v1.0.0

Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or...

⭐ 0· 290·0 current·0 all-time

by@evanydl

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for evanydl/tavily-crwal.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Tavily Crawl" (evanydl/tavily-crwal) from ClawHub.
Skill page: https://clawhub.ai/evanydl/tavily-crwal
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install tavily-crwal

ClawHub CLI

Package manager switcher

npx clawhub@latest install tavily-crwal

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

The declared purpose (crawl websites and save markdown) matches the script's behavior: it builds a JSON-RPC request to Tavily's MCP and writes results to files. However, metadata claims no required binaries or env vars while the script depends on external tools (jq, curl, npx, find, base64, date). The SKILL.md and script also reference different endpoints (api.tavily.com vs mcp.tavily.com) and the SKILL.md suggests placing an API key in ~/.claude/settings.json but the script only reads TAVILY_API_KEY from environment or tokens under ~/.mcp-auth — an inconsistency.

Instruction Scope

The script searches your home directory (~/.mcp-auth) for *_tokens.json and will decode/read JWT payloads there, then uses any valid Tavily token it finds. That is within the stated Tavily OAuth convenience flow, but it reads files from your user profile and may surface tokens automatically. The script also invokes npx to run 'mcp-remote' (backgrounded and silenced), which executes remote code outside the explicit skill scope. SKILL.md states 'No manual setup required' — true functionally, but the opaque npx invocation and home-directory token reads widen data access beyond 'just provide a URL'.

Install Mechanism

There is no install spec in the registry (instruction-only), but the script uses 'npx -y mcp-remote ...' at runtime. npx will fetch and execute code from the npm registry on-demand, which is effectively downloading and running third-party code without an explicit install step or review. That runtime fetch is a higher-risk operation compared to purely local scripts.

ℹ

Credentials

The skill declares no required env vars, but the script expects TAVILY_API_KEY (or an OAuth token found in ~/.mcp-auth). Looking for tokens in ~/.mcp-auth is explainable for an OAuth convenience flow; the script also enforces an issuer check in JWTs (iss == 'https://mcp.tavily.com/'), which limits false positives. Still, the documentation's suggested location (~/.claude/settings.json) is not read by the script, and the script will instead search your auth cache and environment — this mismatch should be clarified before use.

✓

Persistence & Privilege

The skill is not force-included (always:false) and does not change other skills or system-wide settings. It does spawn a temporary background npx process for OAuth flow, but it does not persistently install software or modify other agent configs.

What to consider before installing

Before installing or running this skill consider: 1) The script will attempt to read OAuth tokens from ~/.mcp-auth and will automatically use any valid Tavily token it finds — if you have other sensitive tokens in that directory, be cautious. 2) It runs 'npx -y mcp-remote ...' which downloads and executes an npm package at runtime (silently). If you don't trust the package or tavily.com, run the script in a sandbox/VM and audit the mcp-remote package first. 3) The script requires common CLI tools (jq, curl, npx, base64, find, date) though the skill metadata omits these — make sure those are present and you understand what they do. 4) If you prefer explicit control, set TAVILY_API_KEY in your environment before running (the SKILL.md suggests ~/.claude/settings.json but the script uses the env var or ~/.mcp-auth). 5) If you have sensitive files or tokens in your home directory, inspect ~/.mcp-auth and remove or isolate them before running. If these caveats are acceptable and you trust Tavily and the npm package used, the skill is coherent with its stated purpose; otherwise treat it as risky and run only in an isolated environment.

Like a lobster shell, security has layers — review code before you run it.

latestvk973bc2p06nfmypbeqgqjngegs82mg77

290downloads

0stars

1versions

Updated 5h ago

v1.0.0

MIT-0

Crawl Skill

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.

Authentication

The script uses OAuth via the Tavily MCP server. No manual setup required - on first run, it will:

Check for existing tokens in ~/.mcp-auth/
If none found, automatically open your browser for OAuth authentication

Note: You must have an existing Tavily account. The OAuth flow only supports login — account creation is not available through this flow. Sign up at tavily.com first if you don't have an account.

Alternative: API Key

If you prefer using an API key, get one at https://tavily.com and add to ~/.claude/settings.json:

{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

Quick Start

Using the Script

./scripts/crawl.sh '<json>' [output_dir]

Examples:

# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'

# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'

# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs

# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'

# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'

When output_dir is provided, each crawled page is saved as a separate markdown file.

Basic Crawl

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'

Focused Crawl with Instructions

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

API Reference

Endpoint

POST https://api.tavily.com/crawl

Headers

Header	Value
`Authorization`	`Bearer <TAVILY_API_KEY>`
`Content-Type`	`application/json`

Request Body

Field	Type	Default	Description
`url`	string	Required	Root URL to begin crawling
`max_depth`	integer	1	Levels deep to crawl (1-5)
`max_breadth`	integer	20	Links per page
`limit`	integer	50	Total pages cap
`instructions`	string	null	Natural language guidance for focus
`chunks_per_source`	integer	3	Chunks per page (1-5, requires instructions)
`extract_depth`	string	`"basic"`	`basic` or `advanced`
`format`	string	`"markdown"`	`markdown` or `text`
`select_paths`	array	null	Regex patterns to include
`exclude_paths`	array	null	Regex patterns to exclude
`allow_external`	boolean	true	Include external domain links
`timeout`	float	150	Max wait (10-150 seconds)

Response Format

{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}

Depth vs Performance

Depth	Typical Pages	Time
1	10-50	Seconds
2	50-500	Minutes
3	500-5000	Many minutes

Start with max_depth=1 and increase only if needed.

Crawl for Context vs Data Collection

For agentic use (feeding results into context): Always use instructions + chunks_per_source. This returns only relevant chunks instead of full pages, preventing context window explosion.

For data collection (saving to files): Omit chunks_per_source to get full page content.

Examples

For Context: Agentic Research (Recommended)

Use when feeding crawl results into an LLM context:

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'

Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.

For Context: Targeted Technical Docs

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

For Data Collection: Full Page Archive

Use when saving content to files for later processing:

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'

Returns full page content - use the script with output_dir to save as markdown files.

Map API (URL Discovery)

Use map instead of crawl when you only need URLs, not content:

curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'

Returns URLs only (faster than crawl):

{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}

Tips

Always use chunks_per_source for agentic workflows - prevents context explosion when feeding results to LLMs
Omit chunks_per_source only for data collection - when saving full pages to files
Start conservative (max_depth=1, limit=20) and scale up
Use path patterns to focus on relevant sections
Use Map first to understand site structure before full crawl
Always set a limit to prevent runaway crawls

Comments

Loading comments...