Baidu Speech Synthesis

v1.2.3

Baidu Intelligent Cloud Speech Synthesis (TTS), supporting multi-role dialogue audio generation, SSML/segment-merge dual modes, speech rate/pitch adjustment.

0· 174·1 current·1 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for guoxh/baidu-speech-synthesis.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Baidu Speech Synthesis" (guoxh/baidu-speech-synthesis) from ClawHub.
Skill page: https://clawhub.ai/guoxh/baidu-speech-synthesis
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required env vars: BAIDU_API_KEY, BAIDU_SECRET_KEY
Required binaries: python3, ffmpeg
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install baidu-speech-synthesis

ClawHub CLI

Package manager switcher

npx clawhub@latest install baidu-speech-synthesis
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (Baidu TTS) matches required binaries (python3, ffmpeg), required env vars (BAIDU_API_KEY, BAIDU_SECRET_KEY), and included client/formatter/merger scripts. No unrelated credentials or surprising binaries are requested.
Instruction Scope
SKILL.md and the scripts instruct the agent to read input text files, build SSML, call Baidu token and TTS endpoints, produce temporary audio files and merge them with ffmpeg. These actions are within the stated purpose. Note: some helper scripts (validate_config, diagnose_auth) perform network calls to Baidu endpoints and inspect environment variables (including BAIDU_ACCESS_TOKEN if present); this is expected behavior but worth noting.
Install Mechanism
No remote download/install spec is present (instruction-only install). Dependencies are typical Python libraries and ffmpeg. Minor inconsistency: SKILL.md suggests installing only requests, whereas requirements.txt also lists pydub and python-dotenv; this is not a security issue but is a documentation mismatch to be aware of.
Credentials
Requested environment variables (BAIDU_API_KEY as primary, BAIDU_SECRET_KEY when needed) are proportionate for a Baidu TTS client. The skill supports access_token and IAM key formats as well. One caveat: validate_config enforces specific length/alphanumeric checks for API/Secret that may not match all valid key formats (e.g., bce-v3 IAM keys), causing false failures if using alternate auth methods.
Persistence & Privilege
Skill is not force-included (always: false) and is user-invocable. It allows autonomous invocation (platform default) but does not request elevated or system-wide persistence or credentials for other skills.
Assessment
This skill appears to do what it claims: construct SSML, call Baidu TTS endpoints, and merge audio with ffmpeg. Before installing, consider: (1) Keys you provide (BAIDU_API_KEY / BAIDU_SECRET_KEY or access_token/IAM key) will be used to call Baidu endpoints — keep them secret and prefer least-privilege keys scoped to TTS. (2) validate_config may require both API and Secret for its checks and may reject some valid IAM/access-token formats; if you use an alternative auth method, the validator might give false errors. (3) The skill runs ffmpeg via subprocess and writes temporary files — avoid feeding untrusted input files to prevent maliciously crafted inputs from causing problems. (4) The included requirements.txt lists pydub and python-dotenv in addition to requests; install only what you need and review the code if you plan to run it in sensitive environments. Overall the package is internally consistent with its stated purpose.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

🔊 Clawdis
Binspython3, ffmpeg
EnvBAIDU_API_KEY, BAIDU_SECRET_KEY
Primary envBAIDU_API_KEY
latestvk975sgavhrm60jc9certrmsdgs83z7d7
174downloads
0stars
3versions
Updated 4w ago
v1.2.3
MIT-0

Baidu Intelligent Cloud Speech Synthesis Skill

Triggers

Use this skill when the user mentions:

  • "Convert this dialogue to audio using Baidu TTS"
  • "Generate male-female dialogue, male voice using Duxiaoyao, female voice using Duxiaomei"
  • "Batch process all dialogues in dialogue.txt"
  • "Adjust speech rate to 7, pitch to 6"
  • "View available voice list"
  • "baidu tts", "dialogue to audio", "multi-speaker speech synthesis"
  • "baidu speech synthesis", "multi-speaker dialogue", "Baidu TTS"

Chinese triggers (for Chinese users):

  • "用百度TTS把这段对话转成音频"
  • "生成男女对话,男声用度逍遥,女声用度小美"
  • "批量处理 dialogue.txt 里的所有对话"
  • "调整语速到7,音调到6"
  • "查看可用的音色列表"

Overview

This skill calls the Baidu Intelligent Cloud Speech Synthesis API, supporting multi-speaker dialogue synthesis (SSML mode or segment-merge fallback). It provides rich voice selection, speech rate/pitch/volume adjustment, and can automatically convert text dialogues into audio files with character-specific voices.

Installation Dependencies

# Install Python dependencies
pip install requests

# Ensure ffmpeg is installed (required for audio merging)
# Ubuntu/Debian:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg
# Windows: Download from https://ffmpeg.org/download.html

# Optional: If pydub is needed (alternative merging solution)
# pip install pydub

Environment Variables Setup

Choose one of three authentication methods:

Method 1: API Key + Secret Key (auto-token)

export BAIDU_API_KEY="Your API Key (non-bce-v3 format)"
export BAIDU_SECRET_KEY="Your Secret Key"

Method 2: Direct access_token (starts with 1.)

export BAIDU_API_KEY="1.a6b7dbd428f731035f771b8d********"
# BAIDU_SECRET_KEY not required

Method 3: IAM Key (starts with bce-v3/)

export BAIDU_API_KEY="bce-v3/ALTAK-8h6t5Y7uI9o0P1q3W2e4R5t6Y7u8I9o0P"
# BAIDU_SECRET_KEY not required
# Note: Existing bce-v3/ALTAK-... keys may be dedicated to other services (e.g., search).
# If authentication fails, create a dedicated speech synthesis application to get API Key + Secret Key.

Required Environment Variables

BAIDU_API_KEY must be set. Whether BAIDU_SECRET_KEY is needed depends on the authentication method:

Method 1: API Key + Secret Key (auto-token)

BAIDU_API_KEY=Your API Key (non-bce-v3 format)
BAIDU_SECRET_KEY=Your Secret Key

Method 2: Direct access_token (starts with 1.)

BAIDU_API_KEY=1.a6b7dbd428f731035f771b8d********
# BAIDU_SECRET_KEY not required

Method 3: IAM Key (starts with bce-v3/)

BAIDU_API_KEY=bce-v3/ALTAK-8h6t5Y7uI9o0P1q3W2e4R5t6Y7u8I9o0P
# BAIDU_SECRET_KEY not required

The skill scripts automatically detect the key format and choose the corresponding authentication method. If not set, the user will be prompted.

Usage

1. Direct script invocation (command line)

# Single dialogue file synthesis
python ~/.openclaw/skills/baidu-speech-synthesis/scripts/baidu_tts.py \
    --input dialogue.txt \
    --output conversation.mp3

# Specify voice mapping (character name → voice code)
python scripts/baidu_tts.py \
    --input script.txt \
    --map 小明:1 小红:0 老师:106

# Batch process all .txt files in a directory
python scripts/baidu_tts.py \
    --dir ./dialogues \
    --format mp3

# Adjust parameters
python scripts/baidu_tts.py \
    --input text.txt \
    --spd 7 --pit 6 --vol 5 \
    --aue 3

2. Usage in OpenClaw sessions

When the user triggers the above phrases, the skill will:

  1. Check environment variable configuration
  2. Ask or automatically identify input text/file
  3. Generate SSML according to default or specified voice assignment scheme
  4. Call the Baidu API and return the audio file (can be played automatically or saved)

File Structure

baidu-speech-synthesis/
├── SKILL.md                    # This file
├── scripts/
│   ├── baidu_tts.py            # Main API client (token acquisition, SSML requests, segment merging)
│   ├── dialogue_formatter.py   # Dialogue text → SSML conversion and voice mapping
│   └── audio_merger.py         # ffmpeg audio merging tool (segment merge solution)
└── references/
    ├── voice_list.md           # Voice code table, samples, recommended pairings
    ├── ssml_guide.md           # Baidu SSML tags, limitations, examples
    └── api_setup.md            # How to obtain keys, free quota (5 million chars/month), authentication details

Technical Points

  • Intelligent Mode Selection: Automatically detects multi-voice requirements, defaults to segment synthesis mode (Baidu API only supports single-voice SSML).
  • Segment Synthesis Solution: Splits multi-role dialogues into single-voice segments → synthesizes separately → merges with ffmpeg (solves API limitations, compatible with Python 3.13).
  • SSML Single-Voice Support: Supports single-voice SSML (tex_type=3) for complex speech expressions of individual characters.
  • Automatic Voice Assignment: Default mapping "老王" → Duxiaoyao (3), "张经理" → Duxiaoyu (1), "小李" → Duyaya (4), customizable via --map.
  • Error Handling: Friendly prompts for network timeouts, quota exhaustion, audio merge failures, etc.

Notes

  • Free Quota: Baidu Speech Synthesis provides 5 million characters/month free quota (2026 latest policy), pay-as-you-go beyond that.
  • Authentication Methods: Supports three authentication methods (API Key+Secret Key, access_token, IAM Key), automatically detected by skill.
  • SSML Limitations: SSML text length limited to 1024 bytes (note Chinese character count), recommend each sentence not exceed 120 characters.
  • Dependencies: Segment merge solution requires ffmpeg installation (skill will detect and prompt). No need to install pydub.
  • Voice Expressiveness: Baidu's base voices are relatively flat; recommend enhancing dialogue expressiveness through text optimization (adding语气词, emotional descriptions).
  • Key Security: Do not hardcode API keys in code; always use environment variables or .env files.
  • Error Handling: Detailed guidance provided for authentication failures; refer to references/api_setup.md for help.

Changelog

  • 2026‑03‑31 (v1.2.3): Fixed bare except: statements in audio_merger.py; replaced with proper exception handling to improve debugging and error visibility.
  • 2026‑03‑26 (v1.2.2): Added MIT LICENSE file; updated metadata to declare ffmpeg dependency; addressing ClawHub security warnings.
  • 2026‑03‑26 (v1.2.1): Complete English translation of skill documentation; improved bilingual triggers for both English and Chinese users.
  • 2026‑03‑26 (v1.2): Switched to ffmpeg instead of pydub, solving Python 3.13 compatibility issues; corrected Baidu API limitation description (only supports single-voice SSML); optimized documentation and default voice mapping.
  • 2026‑03‑26 (v1.1): Enhanced authentication support, added IAM Key and direct access_token authentication, updated free quota information, improved error guidance.
  • 2026‑03‑26 (v1.0): Initial release, supporting multi-speaker dialogue synthesis, SSML/segment-merge dual modes.

Comments

Loading comments...