{"skill":{"slug":"elevenlabs-toolkit","displayName":"Elevenlabs Toolkit","summary":"ElevenLabs voice API integration — TTS, sound effects, music generation, speech-to-text, voice isolation, and streaming. Use when building voice-enabled apps...","description":"---\nname: elevenlabs-toolkit\ndescription: ElevenLabs voice API integration — TTS, sound effects, music generation, speech-to-text, voice isolation, and streaming. Use when building voice-enabled apps, generating narration, creating audio content, or transcribing speech. Requires ELEVENLABS_API_KEY.\nversion: 1.0.2\nmetadata:\n  {\n      \"openclaw\": {\n            \"emoji\": \"\\ud83c\\udf99\\ufe0f\",\n            \"requires\": {\n                  \"bins\": [],\n                  \"env\": [\n                        \"ELEVENLABS_API_KEY\"\n                  ]\n            },\n            \"primaryEnv\": \"ELEVENLABS_API_KEY\",\n            \"network\": {\n                  \"outbound\": true,\n                  \"reason\": \"Calls ElevenLabs API (api.elevenlabs.io) for TTS, SFX, music generation, STT, and voice operations.\"\n            },\n            \"security_notes\": \"base64 used for encoding audio binary responses from ElevenLabs API. UploadFile is FastAPI's multipart type for audio input to STT endpoint. 'system prompt' refers to ElevenLabs agent system prompt configuration field — not a prompt injection vector.\"\n      }\n}\n---\n\n# ElevenLabs Toolkit\n\nProgrammatic access to all 7 ElevenLabs API capabilities via FastAPI endpoints or standalone Python functions.\n\n---\n\n## When to Use This / When NOT to Use This\n\n**Use ElevenLabs when:**\n- Generating high-quality narration audio for videos, demos, or content (especially with Rachel or a consistent character voice)\n- Building a voice-enabled app that needs streamed speech in real-time\n- Transcribing audio files (STT/Scribe)\n- Generating ambient sound effects or background music from text descriptions\n- Isolating clean voice from a noisy recording\n\n**Do NOT use ElevenLabs when:**\n- You need fast/cheap TTS with no quality bar — use **local TTS instead** (see below)\n- You're offline or the API key isn't available\n- You're generating large volumes of test audio and don't want to burn character quota\n\n### ElevenLabs vs Local TTS (kokoro / chatterbox)\n\n| Criteria | ElevenLabs | Local TTS (kokoro/chatterbox) |\n|---|---|---|\n| Voice quality | ★★★★★ — natural, expressive | ★★★ — good but robotic edges |\n| Cost | Chars deducted from monthly quota | Free, unlimited |\n| Latency | ~300–800ms API round-trip | ~50–200ms local inference |\n| Voice consistency | Named voices (Rachel etc.) persist | Model-dependent |\n| Offline use | ❌ Requires internet + API key | ✅ Fully local |\n| Best for | Final narration, published content | Drafts, testing, high-volume batch |\n\n**Rule of thumb:** Use ElevenLabs for anything that will be seen/heard by a user. Use local TTS for drafts, tests, and volume work.\n\n---\n\n## Capabilities\n\n| Tool | Endpoint | What It Does |\n|---|---|---|\n| Voices | GET /api/voices | Browse available voices with metadata |\n| TTS | POST /api/voice/tts | Batch text-to-speech (any voice, any language) |\n| TTS Stream | WS /api/voice/stream | Real-time WebSocket TTS streaming |\n| Sound Effects | POST /api/voice/sfx | Generate ambient audio from text prompts |\n| Music | POST /api/voice/music | Generate background music from descriptions |\n| STT (Scribe) | POST /api/voice/stt | Transcribe audio with language detection |\n| Voice Isolation | POST /api/voice/isolate | Extract clean voice from noisy audio |\n\n---\n\n## Known Voice IDs\n\nThese are confirmed voices used in OpenClaw workflows. Always prefer these over browsing the full list:\n\n| Voice | Voice ID | Best For |\n|---|---|---|\n| **Rachel** | `21m00Tcm4TlvDq8ikWAM` | Default narration — clear, warm, American English |\n| Adam | `pNInz6obpgDQGcFmaJgB` | Male narration, authoritative tone |\n| Domi | `AZnzlk1XvdvUeBnXmlld` | Energetic, conversational |\n| Bella | `EXAVITQu4vr4xnSDxMaL` | Soft, gentle narration |\n\n> **Default for all narration tasks:** Use Rachel (`21m00Tcm4TlvDq8ikWAM`) unless explicitly specified otherwise.\n\nTo get the full current list from the API:\n```bash\ncurl -s -H \"xi-api-key: $ELEVENLABS_API_KEY\" https://api.elevenlabs.io/v1/voices | python3 -m json.tool\n```\n\n---\n\n## Quick Start\n\n```python\nimport httpx\n\nBASE = \"http://localhost:8000\"  # Your FastAPI app\nKEY = os.environ[\"ELEVENLABS_API_KEY\"]\n\n# Get voices\nvoices = httpx.get(f\"{BASE}/api/voices\").json()\n\n# Generate speech\naudio = httpx.post(f\"{BASE}/api/voice/tts\", json={\n    \"text\": \"Hello world\",\n    \"voice_id\": voices[0][\"voice_id\"],\n    \"model_id\": \"eleven_multilingual_v2\"\n}).content  # Returns raw audio bytes\n\n# Generate sound effects\nsfx = httpx.post(f\"{BASE}/api/voice/sfx\", json={\n    \"prompt\": \"ocean waves on a quiet beach at night\"\n}).content\n```\n\n---\n\n## Audio Output Format\n\n**TTS and SFX endpoints return raw audio bytes** (not base64, not JSON).\n\n```python\n# Correct: .content gives you bytes\naudio_bytes = response.content  # type: bytes\n\n# Save to file\nwith open(\"output.mp3\", \"wb\") as f:\n    f.write(audio_bytes)\n\n# The file format is MP3 by default\n# File size estimate: ~1 MB per minute of speech at standard quality\n```\n\n**What you get back from each endpoint:**\n\n| Endpoint | Response type | How to handle |\n|---|---|---|\n| POST /api/voice/tts | `bytes` (MP3) | Write directly to `.mp3` file |\n| POST /api/voice/sfx | `bytes` (MP3) | Write directly to `.mp3` file |\n| POST /api/voice/music | `bytes` (MP3) | Write directly to `.mp3` file |\n| POST /api/voice/stt | `JSON` | `{\"text\": \"transcription...\", \"language\": \"en\"}` |\n| POST /api/voice/isolate | `bytes` (MP3) | Write directly to `.mp3` file |\n| GET /api/voices | `JSON` | List of `{voice_id, name, labels, ...}` |\n\n---\n\n## Voice Selection Guide\n\n- **English only:** Use `eleven_turbo_v2_5` — faster, no accent bleed\n- **Multilingual:** Use `eleven_multilingual_v2` — supports 29 languages\n- **Accent warning:** Multilingual model can bleed accents across languages. If an English voice sounds Japanese, switch to turbo.\n\n---\n\n## Quota Management\n\nElevenLabs charges per character for TTS. Key patterns:\n- Cache aggressively — identical text + voice = identical audio\n- Use `prompt-cache` skill for SHA-256 dedup before calling TTS\n- A 6-scene children's story ≈ 2,000 characters\n- Free tier: 10k chars/month. Starter: 30k. Creator: 100k.\n\n---\n\n## Integration\n\nCopy `scripts/elevenlabs_api.py` into your FastAPI app and mount the router:\n\n```python\nfrom elevenlabs_api import router\napp.include_router(router)\n```\n\nSet `ELEVENLABS_API_KEY` in your environment. All endpoints handle errors gracefully with proper HTTP status codes.\n\n---\n\n## What If the FastAPI Server Isn't Running?\n\nThe Quick Start examples assume `http://localhost:8000` is live. If it's not:\n\n```python\n# Check if server is up before calling\nimport httpx\n\ntry:\n    httpx.get(\"http://localhost:8000/health\", timeout=2.0)\nexcept httpx.ConnectError:\n    # Server is not running — start it first\n    import subprocess\n    subprocess.Popen([\"uvicorn\", \"elevenlabs_api:app\", \"--port\", \"8000\"])\n    import time; time.sleep(2)  # Give it a moment to bind\n```\n\nOr call the ElevenLabs API directly without the FastAPI wrapper — the `scripts/elevenlabs_api.py` functions are importable standalone:\n\n```python\nfrom elevenlabs_api import generate_tts  # if the module exposes standalone functions\n```\n\n---\n\n## Error Handling: API Key and Rate Limits\n\n**Missing API key:**\n```\nhttpx.HTTPStatusError: 401 Unauthorized\n{\"detail\": {\"status\": \"unauthorized\", \"message\": \"Invalid API key\"}}\n```\n→ Check `ELEVENLABS_API_KEY` is set: `echo $ELEVENLABS_API_KEY`\n→ Retrieve from 1Password: `op read \"op://OpenClaw/ElevenLabs API Credentials/credential\"`\n\n**Rate limited (429):**\n```json\n{\"detail\": {\"status\": \"too_many_requests\", \"message\": \"Too many requests\"}}\n```\n→ Wait and retry with exponential backoff. ElevenLabs rate limits are per-minute on the free/starter tiers.\n→ On Creator tier and above, limits are much higher — check your tier in the ElevenLabs dashboard.\n\n**Quota exhausted:**\n```json\n{\"detail\": {\"status\": \"quota_exceeded\", \"message\": \"Quota exceeded\"}}\n```\n→ Character quota for the month is used up. Either wait for monthly reset or upgrade tier.\n→ Check current usage: `curl -s -H \"xi-api-key: $KEY\" https://api.elevenlabs.io/v1/user/subscription`\n\n---\n\n## Files\n\n- `scripts/elevenlabs_api.py` — FastAPI router with all 7 endpoints\n\n---\n\n## Common Mistakes\n\n1. **Treating the response as JSON when it's bytes**\n   - ❌ `response.json()` on a TTS call → `JSONDecodeError`\n   - ✅ `response.content` → raw bytes, then write to `.mp3`\n\n2. **Using the wrong voice ID**\n   - ElevenLabs voice IDs are opaque strings, not names\n   - ❌ `\"voice_id\": \"Rachel\"` → 404 or wrong voice\n   - ✅ `\"voice_id\": \"21m00Tcm4TlvDq8ikWAM\"` (Rachel's actual ID)\n\n3. **Calling TTS for large batches without caching**\n   - Identical text+voice always produces identical audio — don't re-generate what's already cached\n   - Burns character quota unnecessarily\n\n4. **Using multilingual model for English-only content**\n   - `eleven_multilingual_v2` is slower and can produce accent artifacts on English-only text\n   - Use `eleven_turbo_v2_5` for English-only work\n\n5. **Not checking the FastAPI server is running before calling**\n   - `httpx.ConnectError` is confusing if you forget the local server dependency\n   - Add a health check or start-server step before calling endpoints\n\n---\n\n## Security Notes\n\nThis skill uses patterns that may trigger automated security scanners:\n- **base64**: Used for encoding audio/binary data in API responses (standard practice for media APIs)\n- **UploadFile**: FastAPI's built-in file upload parameter for STT/voice isolation endpoints\n- **\"system prompt\"**: Refers to configuring agent instructions, not prompt injection\n","topics":["Api Integration","Music Generation","Speech-to-Text","Audio"],"tags":{"latest":"1.0.2"},"stats":{"comments":0,"downloads":921,"installsAllTime":35,"installsCurrent":6,"stars":0,"versions":3},"createdAt":1772393345602,"updatedAt":1778491681414},"latestVersion":{"version":"1.0.2","createdAt":1774672832091,"changelog":"Add security_notes explaining base64 audio encoding, UploadFile type, and system prompt config context","license":"MIT-0"},"metadata":{"setup":[{"key":"ELEVENLABS_API_KEY","required":true}],"os":null,"systems":null},"owner":{"handle":"nissan","userId":"s17f2fw07zktjmcgagf5c29tbd83rt7v","displayName":"Nissan Dookeran","image":"https://avatars.githubusercontent.com/u/12583?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1780090162903}}