Elevenlabs Tts

ElevenLabs TTS (Text-to-Speech) with emotional audio tags for expressive voice synthesis. WhatsApp-compatible voice messages with Opus conversion. Supports 7...

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 6 · 5.1k · 24 current installs · 24 all-time installs

byshaharsh@Shaharsha

MIT-0

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

The skill's stated purpose (ElevenLabs TTS with audio tags and WhatsApp-compatible Opus conversion) aligns with the only declared secret (ELEVENLABS_API_KEY) and with ffmpeg use for format conversion. However, the registry metadata at the top claims 'Required binaries: none' while the SKILL.md metadata and runtime instructions require ffmpeg on PATH — an internal inconsistency. Also the package has no homepage/source URL and an opaque owner id, which reduces trust in provenance.

ℹ

Instruction Scope

The instructions stay within expected scope for a TTS integration: they describe using the ElevenLabs API, configuring an openclaw.json entry, and running ffmpeg to convert formats. The allowed-tools list includes exec (needed for ffmpeg) which is reasonable for audio conversion but grants arbitrary command execution — expected here but worth noting. The SKILL.md contains a detected 'unicode-control-chars' pattern (see scan findings) that can be used to hide or obfuscate content; that is out-of-band for a normal TTS doc and should be inspected.

✓

Install Mechanism

This is an instruction-only skill with no install spec and no code files, so nothing will be downloaded or written during an install step. That lowers install-time risk.

✓

Credentials

The skill requests a single API key (ELEVENLABS_API_KEY) as its primary credential, which is proportional to a service integration. No unrelated secrets or multiple credentials are requested. The instructions do recommend storing the key in openclaw.json — acceptable but note the usual risks of storing secrets in config files.

✓

Persistence & Privilege

The skill is not always-enabled and is user-invocable; it does not request broader system persistence or modification of other skills. Autonomous invocation is allowed by default (disable-model-invocation is false) but that is the normal platform default and not by itself a red flag.

Scan Findings in Context

[unicode-control-chars] unexpected: Hidden/Unicode control characters were detected in SKILL.md. These characters are not necessary for a TTS usage guide and are commonly used to obfuscate or alter how text is parsed by LLMs or UIs. Inspect the raw SKILL.md for invisible characters before trusting it.

What to consider before installing

This skill appears to do what it says (ElevenLabs TTS + ffmpeg conversion) and only needs an ElevenLabs API key, but there are a few things to check before installing: 1) Verify provenance — the skill has no homepage/source and an opaque owner id; prefer skills from known authors. 2) Inspect the SKILL.md raw file for hidden Unicode control characters (the scanner flagged them) — these can hide unexpected instructions or influence parsing. 3) Confirm ffmpeg is installed and intended: the registry metadata omitted it, but the doc requires ffmpeg for Opus conversion; ensure you want exec permission so ffmpeg can be invoked. 4) Store your ELEVENLABS_API_KEY securely (avoid committing it to public repos); if possible create a limited-scope API key. 5) If anything in the SKILL.md still looks obfuscated or if you cannot confirm the source, do not install. If you want, provide the raw SKILL.md (displaying invisible chars) or the skill package source and I can re-check more precisely.

Like a lobster shell, security has layers — review code before you run it.

Current versionv2.2.0

Download zip

ai-voicevk97956ax8hs2vaym8v8hgmzrqd80ts70audiovk97956ax8hs2vaym8v8hgmzrqd80ts70elevenlabsvk97956ax8hs2vaym8v8hgmzrqd80ts70elevenlabs-ttsvk979c8pe8p2rt0bfvt23yed6f180h19chebrewvk97956ax8hs2vaym8v8hgmzrqd80ts70latestvk97b08szheys3x9ha6yj6r0cgn814ze8multilingualvk97956ax8hs2vaym8v8hgmzrqd80ts70nikudvk97956ax8hs2vaym8v8hgmzrqd80ts70openclawvk979c8pe8p2rt0bfvt23yed6f180h19cpodcastvk972m2309rhsxe9b64p9e76nds80e5wjsingingvk97956ax8hs2vaym8v8hgmzrqd80ts70speechvk97956ax8hs2vaym8v8hgmzrqd80ts70text-to-speechvk97956ax8hs2vaym8v8hgmzrqd80ts70ttsvk97956ax8hs2vaym8v8hgmzrqd80ts70voicevk97956ax8hs2vaym8v8hgmzrqd80ts70whatsappvk97956ax8hs2vaym8v8hgmzrqd80ts70

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

Runtime requirements

🎙️ Clawdis

EnvELEVENLABS_API_KEY

Primary envELEVENLABS_API_KEY

SKILL.md

ElevenLabs TTS (Text-to-Speech)

Generate expressive voice messages using ElevenLabs v3 with audio tags.

Prerequisites

ElevenLabs API Key (ELEVENLABS_API_KEY): Required. Get one at elevenlabs.io → Profile → API Keys. Configure in openclaw.json under messages.tts.elevenlabs.apiKey.
ffmpeg: Required for audio format conversion (MP3 → Opus for WhatsApp compatibility). Must be installed and available on PATH.

Quick Start Examples

Storytelling (emotional journey):

[soft] It started like any other day... [pause] But something felt different. [nervous] My hands were shaking as I opened the envelope. [gasps] I got in! [excited] I actually got in! [laughs] [happy] This changes everything!

Horror/Suspense (building dread):

[whispers] The house has been empty for years... [pause] At least, that's what they told me. [nervous] But I keep hearing footsteps. [scared] They're getting closer. [gasps] [panicking] The door— it's opening by itself!

Conversation with reactions:

[curious] So what happened at the meeting? [pause] [surprised] Wait, they fired him?! [gasps] [sad] That's terrible... [sighs] He had a family. [thoughtful] I wonder what he'll do now.

Hebrew (romantic moment):

[soft] היא עמדה שם, מול השקיעה... [pause] הלב שלי פעם כל כך חזק. [nervous] לא ידעתי מה להגיד. [hesitates] אני... [breathes] [tender] את יודעת שאני אוהב אותך, נכון?

Spanish (celebration to reflection):

[excited] ¡Lo logramos! [laughs] [happy] No puedo creerlo... [pause] [thoughtful] Fueron tantos años de trabajo. [emotional] [soft] Gracias a todos los que creyeron en mí. [sighs] [content] Valió la pena cada momento.

Configuration (OpenClaw)

In openclaw.json, configure TTS under messages.tts:

{
  "messages": {
    "tts": {
      "provider": "elevenlabs",
      "elevenlabs": {
        "apiKey": "sk_your_api_key_here",
        "voiceId": "pNInz6obpgDQGcFmaJgB",
        "modelId": "eleven_v3",
        "languageCode": "en",
        "voiceSettings": {
          "stability": 0.5,
          "similarityBoost": 0.75,
          "style": 0,
          "useSpeakerBoost": true,
          "speed": 1
        }
      }
    }
  }
}

Getting your API Key:

Go to https://elevenlabs.io
Sign up/login
Click profile → API Keys
Copy your key

Recommended Voices for v3

These premade voices are optimized for v3 and work well with audio tags:

Voice	ID	Gender	Accent	Best For
Adam	`pNInz6obpgDQGcFmaJgB`	Male	American	Deep narration, general use
Rachel	`21m00Tcm4TlvDq8ikWAM`	Female	American	Calm narration, conversational
Brian	`nPczCjzI2devNBz1zQrb`	Male	American	Deep narration, podcasts
Charlotte	`XB0fDUnXU5powFXDhCwa`	Female	English-Swedish	Expressive, video games
George	`JBFqnCBsd6RMkjVDRZzb`	Male	British	Raspy narration, storytelling

Finding more voices:

Browse: https://elevenlabs.io/voice-library
v3-optimized collection: https://elevenlabs.io/app/voice-library/collections/aF6JALq9R6tXwCczjhKH
API: GET https://api.elevenlabs.io/v1/voices

Voice selection tips:

Use IVC (Instant Voice Clone) or premade voices - PVC not optimized for v3 yet
Match voice character to your use case (whispering voice won't shout well)
For expressive IVCs, include varied emotional tones in training samples

Model Settings

Model: eleven_v3 (alpha) - ONLY model supporting audio tags
Languages: 70+ supported with full audio tag control

Stability Modes

Mode	Stability	Description
Creative	0.3-0.5	More emotional/expressive, may hallucinate
Natural	0.5-0.7	Balanced, closest to original voice
Robust	0.7-1.0	Highly stable, less responsive to tags

For audio tags, use Creative (0.5) or Natural. Higher stability reduces tag responsiveness.

Speed Control

Range: 0.7 (slow) to 1.2 (fast), default 1.0

Extreme values affect quality. For pacing, prefer audio tags like [rushed] or [drawn out].

Critical Rules

Length Limits

Optimal: <800 characters per segment (best quality)
Maximum: 10,000 characters (API hard limit)
Quality degrades with longer text - voice becomes inconsistent

Audio Tags - Best Practices for Natural Sound

How many tags to use:

1-2 tags per sentence or phrase (not more!)
Tags persist until the next tag - no need to repeat
Overusing tags sounds unnatural and robotic

Where to place tags:

At emotional transition points
Before key dramatic moments
When energy/pace changes

Context matters:

Write text that matches the tag emotion
Longer text with context = better interpretation
Example: [nervous] I... I'm not sure about this. What if it doesn't work? works better than [nervous] Hello.

Combine tags for nuance:

[nervously][whispers] = nervous whispering
[excited][laughs] = excited laughter
Keep combinations to 2 tags max

Regenerate for best results:

v3 is non-deterministic - same text = different outputs
Generate 3+ versions, pick the best
Small text tweaks can improve results

Match tag to voice:

Don't use [shouts] on a whispering voice
Don't use [whispers] on a loud/energetic voice
Test tags with your chosen voice

SSML Not Supported

v3 does NOT support SSML break tags. Use audio tags and punctuation instead.

Punctuation Effects (use with tags!)

Punctuation enhances audio tags:

Ellipses (...) → dramatic pauses: [nervous] I... I don't know...
CAPS → emphasis: [excited] That's AMAZING!
Dashes (—) → interruptions: [explaining] So what you do is— [interrupting] Wait!
Question marks → uncertainty: [nervous] Are you sure about this?
Exclamation! → energy boost: [happy] We did it!

Combine tags + punctuation for maximum effect:

[tired] It was a long day... [sighs] Nobody listens anymore.

WhatsApp Voice Messages

Complete Workflow

Generate with tts tool (returns MP3)
Convert to Opus (required for Android!)
Send with message tool

Step-by-Step

1. Generate TTS (add [pause] at end to prevent cutoff):

tts text="[excited] This is amazing! [pause]" channel=whatsapp

Returns: MEDIA:/tmp/tts-xxx/voice-123.mp3

2. Convert MP3 → Opus:

ffmpeg -i /tmp/tts-xxx/voice-123.mp3 -c:a libopus -b:a 64k -vbr on -application voip /tmp/tts-xxx/voice-123.ogg

3. Send the Opus file:

Note: The message field below contains a Unicode Left-to-Right Mark (U+200E) between the quotes. This is intentional — WhatsApp requires a non-empty message body to send voice notes. The LTR mark is invisible but satisfies this requirement without displaying any text.

message action=send channel=whatsapp target="+972..." filePath="/tmp/tts-xxx/voice-123.ogg" asVoice=true message="‎"

Why Opus?

Format	iOS	Android	Transcribe
MP3	✅ Works	❌ May fail	❌ No
Opus (.ogg)	✅ Works	✅ Works	✅ Yes

Always convert to Opus - it's the only format that:

Works on all devices (iOS + Android)
Supports WhatsApp's transcribe button

Audio Cutoff Fix

ElevenLabs sometimes cuts off the last word. Always add [pause] or ... at the end:

[excited] This is amazing! [pause]

Long-Form Audio (Podcasts)

For content >800 chars:

Split into short segments (<800 chars each)
Generate each with tts tool

Concatenate with ffmpeg:

cat > list.txt << EOF
file '/path/file1.mp3'
file '/path/file2.mp3'
EOF
ffmpeg -f concat -safe 0 -i list.txt -c copy final.mp3

Convert to Opus for WhatsApp
Send as single voice message

Important: Don't mention "part 2" or "chapter" - keep it seamless.

Multi-Speaker Dialogue

v3 can handle multiple characters in one generation:

Jessica: [whispers] Did you hear that?
Chris: [interrupting] —I heard it too!
Jessica: [panicking] We need to hide!

Dialogue tags: [interrupting], [overlapping], [cuts in], [interjecting]

Audio Tags Quick Reference

Category	Tags	When to Use
Emotions	[excited], [happy], [sad], [angry], [nervous], [curious]	Main emotional state - use 1 per section
Delivery	[whispers], [shouts], [soft], [rushed], [drawn out]	Volume/speed changes
Reactions	[laughs], [sighs], [gasps], [clears throat], [gulps]	Natural human moments - sprinkle sparingly
Pacing	[pause], [hesitates], [stammers], [breathes]	Dramatic timing
Character	[French accent], [British accent], [robotic tone]	Character voice shifts
Dialogue	[interrupting], [overlapping], [cuts in]	Multi-speaker conversations

Most effective tags (reliable results):

Emotions: [excited], [nervous], [sad], [happy]
Reactions: [laughs], [sighs], [whispers]
Pacing: [pause]

Less reliable (test and regenerate):

Sound effects: [explosion], [gunshot]
Accents: results vary by voice

Full tag list: See references/audio-tags.md

Troubleshooting

Tags read aloud?

Verify using eleven_v3 model
Use IVC/premade voices, not PVC
Simplify tags (no "tone" suffix)
Increase text length (250+ chars)

Voice inconsistent?

Segment is too long - split at <800 chars
Regenerate (v3 is non-deterministic)
Try lower stability setting

WhatsApp won't play?

Convert to Opus format (see above)

No emotion despite tags?

Voice may not match tag style
Try Creative stability mode (0.5)
Add more context around the tag

Files

2 total

Select a file

Select a file to preview.

Comments

Loading comments…