Markdown.New Crawl Local-First Access
Use markdown.new/crawl for multi-page crawling and async Markdown generation.
Required Behavior
- Prefer local API calls with
curl (or any suitable alternative tools).
- Assume no authentication is required unless the service behavior changes.
- Start with
POST /crawl to get a job ID.
- Poll
GET /crawl/status/{jobId} until crawl completes.
- Default to Markdown output; use
?format=json only when structured output is needed.
- Keep crawl scope explicit (
limit, depth, subdomain/external toggles, include/exclude patterns).
- State default crawl behavior: same-domain links only, up to
500 pages per job.
- If
/crawl fails (network, timeout, blocked host), fall back to another method and state the fallback.
Core Commands
curl -X POST "https://markdown.new/crawl" \
-H "Content-Type: application/json" \
-d '{"url":"https://docs.example.com","limit":50}'
curl "https://markdown.new/crawl/status/<jobId>"
curl "https://markdown.new/crawl/status/<jobId>?format=json"
curl "https://markdown.new/crawl/https://docs.example.com"
curl -X DELETE "https://markdown.new/crawl/status/<jobId>"
API Coverage
POST /crawl: create async crawl job and return job ID.
GET /crawl/status/{jobId}: return crawl output and status.
DELETE /crawl/status/{jobId}: cancel a running crawl; completed pages remain available.
GET /crawl/{url}: browser-style shortcut that starts crawl and returns tracking page.
Common Crawl Options
url (required): crawl starting URL.
limit: max pages, 1-500 (default 500).
depth: max link depth, 1-10 (default 5).
render: enable JS rendering for SPA pages.
source: URL discovery strategy (all, sitemaps, links).
maxAge: max cache age seconds (0-604800, default 86400).
modifiedSince: UNIX timestamp; crawl pages modified after this time.
includeExternalLinks: include cross-domain links.
includeSubdomains: include subdomains.
includePatterns / excludePatterns: wildcard URL filtering.
Output Notes
GET /crawl/status/{jobId} returns concatenated Markdown by default.
- Add
?format=json for per-page structured records.
- Images are stripped by default; add
?retain_images=true to keep them.
- Results are retained for 14 days; expired job IDs return errors.
Operational Limits
- Rate model (as documented in the crawl page FAQ): each crawl costs 50 units against a 500 daily limit (about 10 crawls/day).
- Prefer lower
limit values for targeted extraction to reduce cost and runtime.