Text to Speech

v2.23.0

Generate speech audio from text using HeyGen's Starfish TTS model. Use when: (1) Generating standalone speech audio files from text, (2) Converting text to s...

1· 801·6 current·6 all-time
byMichael Wang@michaelwang11394

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for michaelwang11394/text-to-speech-heygen.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Text to Speech" (michaelwang11394/text-to-speech-heygen) from ClawHub.
Skill page: https://clawhub.ai/michaelwang11394/text-to-speech-heygen
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required env vars: HEYGEN_API_KEY
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Canonical install target

openclaw skills install michaelwang11394/text-to-speech-heygen

ClawHub CLI

Package manager switcher

npx clawhub@latest install text-to-speech-heygen
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the declared requirement (HEYGEN_API_KEY) and the SKILL.md shows only HeyGen TTS API endpoints and MCP tool calls; no unrelated services, binaries, or config paths are requested.
Instruction Scope
SKILL.md contains only instructions to call HeyGen's /v3/voices and /v3/voices/speech endpoints (or preferrable MCP tools). It does not instruct reading local files, other env vars, or sending data to unexpected endpoints.
Install Mechanism
No install spec and no code files (instruction-only). Nothing is downloaded or written to disk by the skill itself.
Credentials
Requires a single credential (HEYGEN_API_KEY) which is appropriate for accessing the HeyGen API; no additional secrets or unrelated env vars are requested.
Persistence & Privilege
always is false and the skill is user-invocable; it does not request persistent system presence or modify other skills/configs.
Assessment
This skill is coherent for HeyGen TTS usage. Before installing, ensure you trust the skill source and are comfortable providing HEYGEN_API_KEY to it (the key allows calls to your HeyGen account). Limit exposure by using a scoped or expendable API key if HeyGen supports it, rotate keys regularly, and monitor HeyGen account activity for unexpected requests.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

EnvHEYGEN_API_KEY
Primary envHEYGEN_API_KEY
audiovk97drbtrx4s6fcnadm4gjsa91x84r0feheygenvk97drbtrx4s6fcnadm4gjsa91x84r0felatestvk97drbtrx4s6fcnadm4gjsa91x84r0fespeechvk97drbtrx4s6fcnadm4gjsa91x84r0festarfishvk97drbtrx4s6fcnadm4gjsa91x84r0fetext-to-speechvk97drbtrx4s6fcnadm4gjsa91x84r0fettsvk97drbtrx4s6fcnadm4gjsa91x84r0fevoicevk97drbtrx4s6fcnadm4gjsa91x84r0fe
801downloads
1stars
11versions
Updated 1w ago
v2.23.0
MIT-0

Text-to-Speech (HeyGen Starfish)

Generate speech audio files from text using HeyGen's in-house Starfish TTS model via the v3 API. This skill is for standalone audio generation — separate from video creation.

Authentication

All requests require the X-Api-Key header. Set the HEYGEN_API_KEY environment variable.

curl -X GET "https://api.heygen.com/v3/voices?engine=starfish" \
  -H "X-Api-Key: $HEYGEN_API_KEY"

Tool Selection

If HeyGen MCP tools are available (mcp__heygen__*), prefer them over direct HTTP API calls.

TaskMCP ToolFallback (Direct API)
List TTS voicesmcp__heygen__list_audio_voicesGET /v3/voices?engine=starfish
Generate speech audiomcp__heygen__text_to_speechPOST /v3/voices/speech

Default Workflow

  1. List voices with mcp__heygen__list_audio_voices (or GET /v3/voices?engine=starfish)
  2. Pick a voice matching desired language, gender, and features
  3. Call mcp__heygen__text_to_speech (or POST /v3/voices/speech) with text and voice_id
  4. Use the returned audio_url to download or play the audio

List TTS Voices

Retrieve voices compatible with the Starfish TTS model.

Note: This uses the unified GET /v3/voices endpoint with the engine=starfish filter to return only TTS-compatible voices. Not all video voices support Starfish TTS. The response is paginated — use next_token to fetch additional pages.

Query Parameters

ParamTypeDescription
enginestringFilter by engine (use starfish for TTS voices)
typestringpublic or private
languagestringFilter by language
genderstringFilter by gender
limitintegerResults per page, 1-100
tokenstringPagination cursor from next_token

curl

curl -X GET "https://api.heygen.com/v3/voices?engine=starfish" \
  -H "X-Api-Key: $HEYGEN_API_KEY"

TypeScript

interface AudioVoiceItem {
  voice_id: string;
  name: string;
  language: string;
  gender: "female" | "male" | "unknown";
  preview_audio_url: string | null;
  support_pause: boolean;
  support_locale: boolean;
  type: string;
}

interface TTSVoicesResponse {
  error: null | string;
  data: AudioVoiceItem[];
  has_more: boolean;
  next_token: string | null;
}

async function listTTSVoices(): Promise<AudioVoiceItem[]> {
  const allVoices: AudioVoiceItem[] = [];
  let token: string | null = null;

  do {
    const url = new URL("https://api.heygen.com/v3/voices");
    url.searchParams.set("engine", "starfish");
    if (token) url.searchParams.set("token", token);

    const response = await fetch(url.toString(), {
      headers: { "X-Api-Key": process.env.HEYGEN_API_KEY! },
    });

    const json: TTSVoicesResponse = await response.json();

    if (json.error) {
      throw new Error(json.error);
    }

    allVoices.push(...json.data);
    token = json.next_token;
  } while (token);

  return allVoices;
}

Python

import requests
import os

def list_tts_voices() -> list:
    all_voices = []
    token = None

    while True:
        params = {"engine": "starfish"}
        if token:
            params["token"] = token

        response = requests.get(
            "https://api.heygen.com/v3/voices",
            headers={"X-Api-Key": os.environ["HEYGEN_API_KEY"]},
            params=params,
        )

        data = response.json()
        if data.get("error"):
            raise Exception(data["error"])

        all_voices.extend(data["data"])

        if not data.get("has_more"):
            break
        token = data.get("next_token")

    return all_voices

Response Format

{
  "error": null,
  "data": [
    {
      "voice_id": "f38a635bee7a4d1f9b0a654a31d050d2",
      "name": "Chill Brian",
      "language": "English",
      "gender": "male",
      "preview_audio_url": "https://resource.heygen.ai/text_to_speech/WpSDQvmLGXEqXZVZQiVeg6.mp3",
      "support_pause": true,
      "support_locale": false,
      "type": "public"
    }
  ],
  "has_more": false,
  "next_token": null
}

Generate Speech Audio

Convert text to speech audio using a specified voice.

Endpoint

POST https://api.heygen.com/v3/voices/speech

Request Fields

FieldTypeReqDescription
textstringYText content to convert (1-5000 characters)
voice_idstringYVoice ID from GET /v3/voices?engine=starfish
input_typestring"text" (default) or "ssml" for full SSML markup
speednumberSpeech speed, 0.5-2.0 (default: 1.0)
languagestringBase language code (e.g., "en", "pt"). Auto-detected if omitted
localestringBCP-47 locale for multilingual voices (e.g., "en-US", "pt-BR")

curl

curl -X POST "https://api.heygen.com/v3/voices/speech" \
  -H "X-Api-Key: $HEYGEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello! Welcome to our product demo.",
    "voice_id": "YOUR_VOICE_ID",
    "speed": 1.0
  }'

TypeScript

interface TTSRequest {
  text: string;
  voice_id: string;
  input_type?: "text" | "ssml";
  speed?: number;
  language?: string;
  locale?: string;
}

interface WordTimestamp {
  word: string;
  start: number;
  end: number;
}

interface TTSResponse {
  error: null | string;
  data: {
    audio_url: string;
    duration: number;
    request_id?: string;
    word_timestamps?: WordTimestamp[];
  };
}

async function textToSpeech(request: TTSRequest): Promise<TTSResponse["data"]> {
  const response = await fetch(
    "https://api.heygen.com/v3/voices/speech",
    {
      method: "POST",
      headers: {
        "X-Api-Key": process.env.HEYGEN_API_KEY!,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(request),
    }
  );

  const json: TTSResponse = await response.json();

  if (json.error) {
    throw new Error(json.error);
  }

  return json.data;
}

Python

import requests
import os

def text_to_speech(
    text: str,
    voice_id: str,
    input_type: str = "text",
    speed: float = 1.0,
    language: str | None = None,
    locale: str | None = None,
) -> dict:
    payload = {
        "text": text,
        "voice_id": voice_id,
        "speed": speed,
    }

    if input_type != "text":
        payload["input_type"] = input_type

    if language:
        payload["language"] = language

    if locale:
        payload["locale"] = locale

    response = requests.post(
        "https://api.heygen.com/v3/voices/speech",
        headers={
            "X-Api-Key": os.environ["HEYGEN_API_KEY"],
            "Content-Type": "application/json",
        },
        json=payload,
    )

    data = response.json()
    if data.get("error"):
        raise Exception(data["error"])

    return data["data"]

Response Format

{
  "error": null,
  "data": {
    "audio_url": "https://resource2.heygen.ai/text_to_speech/.../id=365d46bb.wav",
    "duration": 5.526,
    "request_id": "p38QJ52hfgNlsYKZZmd9",
    "word_timestamps": [
      { "word": "<start>", "start": 0.0, "end": 0.0 },
      { "word": "Hey", "start": 0.079, "end": 0.219 },
      { "word": "there,", "start": 0.239, "end": 0.459 },
      { "word": "<end>", "start": 5.526, "end": 5.526 }
    ]
  }
}

Usage Examples

Basic TTS

const result = await textToSpeech({
  text: "Welcome to our quarterly earnings call.",
  voice_id: "YOUR_VOICE_ID",
});

console.log(`Audio URL: ${result.audio_url}`);
console.log(`Duration: ${result.duration}s`);

With Speed Adjustment

const result = await textToSpeech({
  text: "We're thrilled to announce our newest feature!",
  voice_id: "YOUR_VOICE_ID",
  speed: 1.1,
});

With Language and Locale for Multilingual Voices

const result = await textToSpeech({
  text: "Bem-vindo ao nosso produto.",
  voice_id: "MULTILINGUAL_VOICE_ID",
  language: "pt",
  locale: "pt-BR",
});

With SSML Input

const result = await textToSpeech({
  text: '<speak>Hello <break time="1s"/> and welcome!</speak>',
  voice_id: "YOUR_VOICE_ID",
  input_type: "ssml",
});

Find a Voice and Generate Audio

async function generateSpeech(text: string, language: string): Promise<string> {
  const voices = await listTTSVoices();
  const voice = voices.find(
    (v) => v.language.toLowerCase().includes(language.toLowerCase())
  );

  if (!voice) {
    throw new Error(`No TTS voice found for language: ${language}`);
  }

  const result = await textToSpeech({
    text,
    voice_id: voice.voice_id,
  });

  return result.audio_url;
}

const audioUrl = await generateSpeech("Hello and welcome!", "english");

Pauses with Break Tags

Use SSML-style break tags in your text for pauses:

word <break time="1s"/> word

Rules:

  • Use seconds with s suffix: <break time="1.5s"/>
  • Must have spaces before and after the tag
  • Self-closing tag format

With v3, you can also use input_type: "ssml" for full SSML support, allowing richer markup beyond just break tags:

{
  "text": "<speak>Welcome! <break time=\"1s\"/> Let's get started.</speak>",
  "voice_id": "YOUR_VOICE_ID",
  "input_type": "ssml"
}

Best Practices

  1. Use GET /v3/voices?engine=starfish to find compatible voices — the unified /v3/voices endpoint serves all voice types, so the engine=starfish filter is essential for TTS
  2. Check support_locale before setting a locale — only multilingual voices support locale selection
  3. Keep speed between 0.8-1.2 for natural-sounding output
  4. Preview voices using the preview_audio_url before generating (may be null for some voices)
  5. Use word_timestamps in the response for caption syncing or timed text overlays
  6. Use SSML break tags in your text for pauses: word <break time="1s"/> word
  7. Use input_type: "ssml" when you need full SSML markup control beyond simple break tags
  8. Paginate voice listing — the v3 endpoint returns paginated results; use has_more and next_token to fetch all voices

Comments

Loading comments...