Qwen3-TTS VoiceDesign

v1.0.0

Text-to-speech with Qwen3-TTS VoiceDesign. Design custom voices via natural language descriptions + seed-based timbre fixation. Includes OpenAI-compatible AP...

⭐ 0· 628·0 current·0 all-time

by@xiaoyaner0201

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for xiaoyaner0201/qwen3-tts-voicedesign.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Qwen3-TTS VoiceDesign" (xiaoyaner0201/qwen3-tts-voicedesign) from ClawHub.
Skill page: https://clawhub.ai/xiaoyaner0201/qwen3-tts-voicedesign
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Canonical install target

openclaw skills install xiaoyaner0201/qwen3-tts-voicedesign

ClawHub CLI

Package manager switcher

npx clawhub@latest install qwen3-tts-voicedesign

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Benign

medium confidence

✓

Purpose & Capability

Name/description (Qwen3-TTS VoiceDesign TTS server + client tools) matches the included scripts: a FastAPI server, client helpers, setup script and seed-batching tooling. The declared behavior (model download, one-click setup, OpenAI-compatible API) is consistent with the code.

ℹ

Instruction Scope

SKILL.md instructs running setup.sh which creates a venv, pip-installs dependencies, downloads the model (ModelScope or Hugging Face), and runs the server; the runtime scripts only reference their .env and local files. Notable scope items: the server code clears proxy environment variables at start (potentially bypassing a corporate proxy), and the docs show guidance to register scheduled tasks or systemd units (these are only instructions, not executed automatically). The client scripts build JSON bodies via shell interpolation (potential for malformed input/escaping issues if used with untrusted text).

ℹ

Install Mechanism

There is no platform install spec, but setup.sh will pip-install packages (qwen-tts, soundfile, pydub, uvicorn, fastapi, numpy and possibly modelscope and torch from the official PyTorch index). It downloads the ~3.5GB model via ModelScope or Hugging Face. These are expected for a local TTS runtime but do involve network access and large binary downloads; the sources used (ModelScope/HuggingFace, PyTorch wheel index) are standard release hosts rather than arbitrary shorteners.

✓

Credentials

The skill requests no credentials and exposes only environment variables relevant to running a local TTS server (seed, instruct, model path, host/port, format). The only surprising behavior is that the server explicitly clears HTTP(S) proxy environment variables at startup, which may affect network routing on hosts that rely on proxies; this is operational (not credential) behavior and not an attempt to read secrets.

✓

Persistence & Privilege

The skill is not always-enabled and does not attempt to change other skills' config. setup.sh suggests how to create systemd units or a Windows scheduled task, but it does not automatically create system-level services or elevate privileges. You must run setup/start manually, so persistence is user-controlled.

Assessment

This package appears to do what it says: set up a local TTS server, download a voice model, and provide client scripts. Before installing: 1) Expect large downloads (~3.5GB) and pip installing many packages (including torch/CUDA) — run in a controlled environment or VM/container if you don't want changes to your main system. 2) The server clears HTTP(S)_PROXY env vars at startup — if you are on a corporate network that requires a proxy for outbound connections, that may change routing; run behind a firewall or bind the server to 127.0.0.1 (TTS_HOST) if you only need local access. 3) The setup and server will download model data from ModelScope/HuggingFace and install packages from PyPI — verify you trust those sources and the specified model repo. 4) The client shell scripts construct JSON via simple interpolation — avoid passing untrusted/unsanitized text that could break the shell invocation. 5) If you plan to expose the server beyond localhost, secure it (firewall, reverse proxy, auth) because it exposes an HTTP API. If you want more assurance, run setup in an isolated container, inspect the pip-installed packages and the model repo, and avoid enabling systemd/scheduled-task instructions unless you understand the implications.

Like a lobster shell, security has layers — review code before you run it.

latestvk979achs9ev31hkgpc9dpz6ba181swqd

628downloads

0stars

1versions

Updated 9h ago

v1.0.0

MIT-0

Qwen3-TTS VoiceDesign

Text → Speech with natural language voice descriptions + seed-based timbre fixation.

Quick Start

# Generate speech (uses server defaults)
TTS_URL=http://your-server:8881 scripts/say.sh "Hello world!"

# Save to file
scripts/say.sh "Save this" output.mp3

# Batch compare seeds (voice exploration)
scripts/batch_seeds.sh "Hello world!" 42 123 201 456 789 /tmp/seeds

Environment Variables

All config via env vars — text is the only required argument:

Variable	Default	Description
`TTS_URL`	`http://localhost:8881`	Server base URL (client side)
`TTS_SEED`	`4096`	Random seed → controls timbre
`TTS_INSTRUCT`	(generic female voice)	Voice description prompt
`TTS_MODEL_PATH`	`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	Model weights path
`TTS_PORT`	`8881`	Server listen port
`TTS_HOST`	`0.0.0.0`	Server bind address
`TTS_FORMAT`	`mp3`	Output format: `mp3` / `wav`

Server reads from .env file in its directory. Client scripts read from shell env.

Voice Description Example

30岁男性播音员，声音低沉磁性，
语速稳重从容，咬字清晰标准，
像新闻联播主播的专业感，又带一点温暖。

Tip: Once you've found your perfect voice (description + seed), set them as server defaults in .env. Then client calls only need to pass text.

API

OpenAI-Compatible

curl -X POST $TTS_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello!"}' -o speech.mp3

Custom (seed + instruct override)

curl -X POST $TTS_URL/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello!", "seed": 201, "instruct": "温柔女生"}' -o speech.mp3

GET (quick test)

curl "$TTS_URL/tts?text=Hello&seed=201" -o test.mp3

Seed Mechanics

Same (description + seed) → same timbre. Different seeds → completely different voices.

⚠️ Seeds are purely random — seed 42 and 43 can sound completely different. Finding a voice = opening blind boxes.

Workflow: fix description → batch 30-40 seeds → listen → shortlist 2-3 → compare across scenarios → pick.

Deploy Your Own

# One-click setup (Python 3.10+ and CUDA GPU required)
bash scripts/setup.sh ./my-tts

# Configure voice in .env
echo 'TTS_SEED=201' >> ./my-tts/.env
echo 'TTS_INSTRUCT=Your voice description here' >> ./my-tts/.env

# Start server
bash scripts/setup.sh start ./my-tts

Setup installs: qwen-tts, soundfile, pydub, uvicorn, fastapi, torch (CUDA). Downloads VoiceDesign model (~3.5GB) via ModelScope (China) or HuggingFace.

Requirements: CUDA GPU with 4GB+ VRAM, Python 3.10+, ~4GB disk.

Scripts

Script	Purpose
`scripts/say.sh`	Generate speech — `say.sh "text" [output.mp3]`
`scripts/batch_seeds.sh`	Compare seeds — `batch_seeds.sh "text" seed1 seed2 ...`
`scripts/tts_server.py`	FastAPI server (fully env-configurable)
`scripts/setup.sh`	One-click deploy (venv + deps + model download)

OpenClaw Integration

In openclaw.json:

{
  "env": { "OPENAI_TTS_BASE_URL": "http://<your-server>:8881/v1" },
  "messages": {
    "tts": {
      "provider": "openai",
      "openai": { "apiKey": "dummy", "model": "qwen3-tts", "voice": "default" },
      "timeoutMs": 120000
    }
  }
}

Server Management

# Health check
curl -s $TTS_URL/health

# Start (foreground)
python tts_server.py

# Start (background, Linux/macOS)
nohup python tts_server.py > server.log 2>&1 &

# Auto-restart (Windows — scheduled task + guard script)
# Create tts_guard.bat:
#   @echo off
#   :loop
#   python tts_server.py
#   timeout /t 10
#   goto loop
# Register: schtasks /create /tn "TTS-Guard" /tr "tts_guard.bat" /sc onlogon /rl highest

# Auto-restart (Linux — systemd)
# See setup.sh output for systemd unit template

# Stop
# Linux/macOS: kill $(lsof -ti:8881)
# Windows: for /f "tokens=5" %a in ('netstat -aon ^| findstr :8881') do taskkill /PID %a /F

Troubleshooting

Connection refused → Server not running; start it
30s+ first request → Cold start (model loading ~60s); subsequent requests 10-15s
Behind proxy → Set NO_PROXY=<server_ip> on client side
Windows firewall → netsh advfirewall firewall add rule name="TTS" dir=in action=allow protocol=TCP localport=8881
No flash-attn on Windows → Expected; falls back to PyTorch SDPA (slower but works)
PowerShell corrupts Chinese → Edit .env/config via Python or SCP, not PowerShell Set-Content
Process dies on SSH disconnect → Use scheduled task (Windows) or systemd (Linux) instead of foreground

Voice Design Tips

Describe like casting a voice actor:

Age/gender: "18岁女大学生" / "30岁男性播音员"
Texture: "柔和温暖" / "清脆明亮" / "低沉磁性"
Emotion: "轻柔细腻" / "活泼开朗"
Accent: "南方口音软糯" / "台湾腔" / "东北大碴子味"
Metaphor: "像棉花糖" / "像播音主持" (helps the model capture feeling)

⚠️ Timbre ≠ description. Description controls style/emotion; seed controls timbre. Don't put personality traits ("灵动俏皮") in description — that's the seed's job.

Comments

Loading comments...