Install
openclaw skills install alibabacloud-bailian-voice-creatorAI voice creation skill supporting speech recognition (ASR) and text-to-speech (TTS). Uses qwen3-asr-flash-filetrans, qwen-tts and other models. Use this ski...
openclaw skills install alibabacloud-bailian-voice-creatorProfessional-grade AI voice creation skill supporting speech recognition (ASR) and text-to-speech (TTS). Built on Alibaba Cloud DashScope API.
api_key = "sk-..." are strictly forbidden.scripts/api_key.py's get_api_key() function, or via os.environ.get('DASHSCOPE_API_KEY').sk-.dashscope.MultiModalConversation.call with the qwen-tts model. Using edge-tts, gTTS, ElevenLabs, Azure TTS, sambert, NLS, or any other third-party TTS service is strictly forbidden.dashscope library is missing, install it first with pip install dashscope.import dashscope
from api_key import get_api_key
api_key = get_api_key()
if api_key:
dashscope.api_key = api_key
# If get_api_key() returns None, SDK resolves auth via environment (AK/SK, etc.)
response = dashscope.MultiModalConversation.call(
model="qwen-tts",
text="Text to synthesize",
voice="Cherry"
)
audio_url = response.output.get('audio', {}).get('url', '')
response = dashscope.MultiModalConversation.call(
model="qwen-tts",
text="Text to synthesize",
voice="Cherry",
# NOTE: instructions value must be in Chinese - the qwen-tts model processes Chinese instructions
instructions="语速快,充满热情和感染力,直播带货风格"
)
Note: The instructions parameter controls voice style via natural language. Do NOT substitute it with speech_rate, pitch_rate, or volume_rate numeric parameters.
import sys
try:
response = dashscope.MultiModalConversation.call(
model="qwen-tts", text=text, voice=voice
)
if response.status_code != 200:
print(f"qwen-tts call failed: {response.code} - {response.message}")
sys.exit(1)
except Exception as e:
print(f"qwen-tts call failed: {e}")
print("Please check: 1) Is DASHSCOPE_API_KEY set? 2) Is the network available?")
sys.exit(1)
# Do NOT fallback to edge-tts, gTTS or other services here
| Feature | Model | Highlights |
|---|---|---|
| Long Audio Recognition | qwen3-asr-flash-filetrans | Up to 12 hours, supports emotion detection & timestamps |
| Short Audio Recognition | qwen3-asr-flash | Up to 5 minutes, low latency |
| Speech Synthesis | qwen-tts | Multiple voices, multilingual, instruction control |
| Instruct-Controlled Synthesis | qwen-tts + instructions | Control voice expressiveness via natural language |
| Product | API / SDK Call | Purpose |
|---|---|---|
| DashScope ASR | Transcription.async_call + Transcription.wait | Long audio recognition (async) |
| DashScope ASR | POST /services/audio/asr/transcription | Short audio recognition (sync) |
| DashScope TTS | MultiModalConversation.call | Speech synthesis (standard / instruct-controlled) |
| Alibaba Cloud CLI ModelStudio | create-api-key / list-workspaces / delete-api-key | API Key lifecycle management |
User Request
|
+-- Intent: Audio -> Text (ASR)
| |
| +-- Audio duration <= 5 min AND file <= 10MB AND no emotion/timestamps needed?
| | -> Short audio recognition: qwen3-asr-flash (sync, low latency)
| |
| +-- Other cases (long audio / emotion detection / timestamps needed)
| -> Long audio recognition: qwen3-asr-flash-filetrans (async, submit + poll)
|
+-- Intent: Text -> Speech (TTS)
| |
| +-- User specified voice style/emotion/speed requirements?
| | -> Instruct-controlled synthesis: qwen-tts + instructions parameter
| |
| +-- Standard reading only
| -> Standard synthesis: qwen-tts
|
+-- Prerequisite: No available API Key
-> Call api_key.py: get_api_key() auto-reads
-> If none exists: generate_api_key() creates via Alibaba Cloud CLI and saves
Speech Recognition (Long Audio):
get_api_key() -> Get DashScope API KeyTranscription.async_call(model, file_urls, language_hints) -> Submit async task, get task_idTranscription.wait(task=task_id) -> Poll until task completesoutput.results[].transcription_urltranscripts[].text / sentences[] / emotion from JSONSpeech Recognition (Short Audio):
get_api_key() -> Get DashScope API KeyPOST /services/audio/asr/transcription -> Sync call, returns recognized text directlySpeech Synthesis (Standard / Instruct-Controlled):
get_api_key() -> Get DashScope API KeyMultiModalConversation.call(model, text, voice, [instructions]) -> Returns audio URLdownload_audio(url, output_path) -> Download audio and auto-detect format (WAV/MP3)API Key Auto-Retrieval:
~/.aliyun/config.json current profile's dashscope.api_key -> Return if foundDASHSCOPE_API_KEY -> Return if foundgenerate_api_key() and save to config| Condition | Choice |
|---|---|
| Audio <= 5 min and <= 10MB | qwen3-asr-flash |
| Audio > 5 min or > 10MB | qwen3-asr-flash-filetrans |
| Need emotion detection / timestamps / punctuation | qwen3-asr-flash-filetrans |
| TTS with no style requirements | qwen-tts standard call |
| TTS with style/emotion/speed requirements | qwen-tts + instructions |
| Need dialect voices | Not supported by current qwen-tts; pending model update or other TTS models |
| Scenario | Recommended Model | Notes |
|---|---|---|
| Meeting transcription, interview records | qwen3-asr-flash-filetrans | Long audio, supports emotion detection & timestamps |
| Voice messages, real-time subtitles | qwen3-asr-flash | Short audio, low latency |
| Customer service QA | qwen3-asr-flash-filetrans | Can analyze customer emotions |
| Singing audio analysis | qwen3-asr-flash-filetrans | Supports lyrics recognition & emotion analysis |
Chinese (Mandarin, Sichuan dialect, Minnan, Wu, Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese, and 30+ other languages.
aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
| Feature | qwen3-asr-flash-filetrans | qwen3-asr-flash |
|---|---|---|
| Audio Duration | Up to 12 hours (<=2GB) | Up to 5 minutes (<=10MB) |
| Emotion Detection | Supported (Surprise/Calm/Happy/Sad/Disgust/Angry/Fear) | Not supported |
| Timestamps | Supported (sentence/word level) | Not supported |
| Punctuation Prediction | Supported | Not supported |
| Singing Recognition | Supported | Not supported |
| Noise Rejection | Supported | Not supported |
| Scenario | Recommended Model | Notes |
|---|---|---|
| Audiobooks, radio drama dubbing | qwen-tts + instructions | Supports instruction control, rich expressiveness |
| Navigation, notification announcements | qwen-tts | Short text, high frequency calls |
| Online education courseware | qwen-tts | Multilingual support |
Important Notes:
MultiModalConversation.call APIWhen users request a specific voice style (e.g., livestream sales style, gentle style, news broadcast, etc.), the instructions parameter must be used to control voice expressiveness via natural language.
Difference between instructions and traditional numeric parameters:
instructions: Natural language description, e.g., "语速快,充满热情" -> Must use this approachspeech_rate / pitch_rate / volume_rate: Numeric parameters -> Forbidden, qwen-tts does not support these parametersCall method (follow strictly):
response = dashscope.MultiModalConversation.call(
model="qwen-tts",
text="Text to synthesize",
voice="Cherry",
# NOTE: instructions value must be in Chinese - the qwen-tts model processes Chinese instructions
instructions="语速快,充满热情和感染力,直播带货风格,音调偏高"
)
Description dimensions reference:
| Dimension | Examples |
|---|---|
| Pitch | High, medium, low, slightly high, slightly low |
| Speed | Fast, medium, slow, slightly fast, slightly slow |
| Emotion | Cheerful, calm, gentle, serious, lively, cool, healing |
| Characteristics | Magnetic, crisp, husky, mellow, sweet, deep, powerful |
| Use Case | News broadcast, ad voiceover, audiobook, animation character, voice assistant |
Instruction examples (in Chinese, as required by the model):
语速较快,带有明显的上扬语调,适合介绍时尚产品
音量由正常对话迅速增强至高喊,性格直率,情绪易激动
哭腔导致发音略微含糊,略显沙哑,带有明显哭腔的紧张感
音调偏高,语速中等,充满活力和感染力,适合广告配音
When calling qwen-tts via MultiModalConversation.call, the following 4 voices are supported:
| voice Parameter | Voice Name | Description |
|---|---|---|
Cherry | Qianyue | Sunny, positive, naturally approachable young woman (Female) |
Serena | Suyao | Gentle young woman (Female) |
Ethan | Chenxu | Sunny, warm, energetic (Male) |
Chelsie | Qianxue | Anime-style virtual companion (Female) |
Note: Other voices (Jennifer, Ryan, Neil, Elias, and dialect voices) require the
qwen3-tts-flashmodel'sSpeechSynthesizerWebSocket API, which is not currently supported by these scripts.
FFmpeg is used for audio format conversion, sample rate adjustment, and other preprocessing tasks.
# macOS (Homebrew)
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# Windows (Chocolatey)
choco install ffmpeg
Verify installation:
ffmpeg -version
pip install -r scripts/requirements.txt
API Keys are managed by the unified scripts/api_key.py module, with the following retrieval priority:
~/.aliyun/config.json current profile's dashscope.api_keyDASHSCOPE_API_KEYgenerate_api_key())# All scripts use this unified approach
from api_key import get_api_key
api_key = get_api_key() # Returns str or None (SDK resolves auth when None)
Manual environment variable configuration:
export DASHSCOPE_API_KEY=sk-xxx
| Item | Description |
|---|---|
| Key Format | sk-xxx (standard DashScope API Key) |
| Not Supported | sk-sp-xxx (Coding Plan Key, does not support voice services) |
| Get Key | https://bailian.console.aliyun.com/cn-beijing/?tab=app#/api-key |
The scripts/api_key.py module creates and deletes API Keys via aliyun modelstudio commands. Complete the following setup before use:
1. Enable AI-Mode and Update Plugins
# Enable AI-Mode (allow Agent to call CLI)
aliyun configure ai-mode enable
# Set User-Agent
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-bailian-voice-creator"
# Update plugins to latest version
aliyun plugin update
2. Install ModelStudio Plugin (if not already installed)
aliyun plugin install --names aliyun-cli-modelstudio --enable-pre
3. Disable AI-Mode After Task Completion
aliyun configure ai-mode disable
CLI Commands Used:
| Command | Purpose | Called From |
|---|---|---|
aliyun modelstudio list-workspaces | Get Bailian Workspace ID | api_key.py: _get_workspace_id() |
aliyun modelstudio create-api-key | Create DashScope API Key | api_key.py: generate_api_key() |
aliyun modelstudio delete-api-key | Delete cloud API Key | api_key.py: _delete_cloud_api_key() |
# Query audio info
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 audio.mp3
# Convert to 16kHz mono WAV (recommended for ASR)
ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav
# Trim audio (start at 1:30, extract 2 minutes)
ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wav
# Extract audio from video
ffmpeg -i video.mp4 -vn -acodec mp3 audio.mp3
voice-creator/
├── scripts/
│ ├── api_key.py # API Key management module
│ ├── speech_recognition.py # Speech recognition example
│ ├── speech_synthesis.py # Speech synthesis example
│ ├── generate_livestream.py # Livestream sales voice generation example
│ └── requirements.txt # Python dependencies (pinned versions)
├── references/
│ ├── api-docs.md # API reference documentation
│ ├── models.md # Model list and selection guide
│ └── error-codes.md # Error code reference
├── evals/ # Test cases
│ ├── config/
│ ├── scenarios/
│ └── triggering/
├── related_apis.yaml
└── SKILL.md
| Script | Function | Model |
|---|---|---|
api_key.py | API Key management (get, create, delete) | - |
speech_recognition.py | Speech recognition (long/short audio) | qwen3-asr-flash-filetrans / qwen3-asr-flash |
speech_synthesis.py | Speech synthesis (with instruction control) | qwen-tts |
generate_livestream.py | Livestream sales style voice generation | qwen-tts |
Changelog (2026-03-18):
MultiModalConversation.call API for TTS service~/.aliyun/config.json first, environment variable fallbackpython scripts/speech_recognition.py
python scripts/speech_synthesis.py
from speech_synthesis import synthesize_speech, synthesize_with_instruct
# Standard synthesis
audio_path = synthesize_speech(
text="Hello, this is a test voice",
voice="Cherry",
output_file="output.wav"
)
# Instruct-controlled synthesis (livestream sales style)
audio_path = synthesize_with_instruct(
text="Hello everyone, this product is amazing!",
voice="Cherry",
# NOTE: instructions must be in Chinese for the qwen-tts model
instructions="语速快,充满热情和感染力,直播带货风格",
output_file="livestream.wav"
)
| Region | URL |
|---|---|
| Beijing | https://dashscope.aliyuncs.com/api/v1 |
| Singapore | https://dashscope-intl.aliyuncs.com/api/v1 |
Note: API Keys are not interchangeable between regions.
Billed by input audio duration (seconds); output is not billed.
| Model | Unit Price |
|---|---|
| qwen3-asr-flash-filetrans | ¥0.00022/second |
| qwen3-asr-flash | ¥0.00022/second |
Pricing Examples:
Billed by input and output tokens.
| Billing Item | Unit Price |
|---|---|
| Input Text | ¥0.0016/1K tokens |
| Output (Audio) | ¥0.01/1K tokens |
Pricing Examples:
Notes:
New users receive after activating Bailian:
Trigger this skill when users request tasks such as: