Qwen Audio

v0.0.6

High-performance audio library with text-to-speech (TTS) and speech-to-text (STT).

1· 399·1 current·1 all-time
bynoah@darknoah
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
Purpose & Capability
Name/description (TTS/STT) matches the included code and pyproject dependencies (qwen-asr, qwen-tts, mlx-audio, torch). However the SKILL.md and registry metadata claim no required binaries/env vars while the instructions and code rely on the 'uv' CLI, Python >=3.10, and may require network access to download large models. The overall capability is coherent with its stated purpose but some required runtime pieces are not declared in the metadata.
Instruction Scope
Runtime instructions tell the agent to run 'uv run ...' and to manipulate a local ./voices/ directory; the code will read and write these local voice files. Instructions require the user to run env-checks and to explicitly confirm voice selection before TTS, which limits accidental use. The SKILL.md does not explicitly warn that model downloads and package installs will occur, but the code will contact Hugging Face and other endpoints and can operate in online/offline modes.
!
Install Mechanism
There is no platform install spec (instruction-only), but the pyproject.toml lists heavy ML dependencies and a custom torch index. The script itself will run a shell command (os.system("uv add mlx-audio ...")) to install missing packages at runtime. Auto-install and model downloads introduce moderate risk (large network/disk operations and execution of runtime-installed packages).
Credentials
The skill declares no required environment variables, but the code reads/uses QWEN_AUDIO_DEVICE, QWEN_AUDIO_DTYPE, HF_ENDPOINT and may set HF_HUB_OFFLINE. No secret or credential env vars are requested. The mismatch between declared requirements and actual env usage reduces transparency and should be resolved before trusting the skill.
Persistence & Privilege
always is false and the skill does not request system-wide config changes or other skills' credentials. It will write voice profiles under its own ./voices/ directory and may create/update files like references/env-check-list.md as instructed, which is normal for a local audio skill.
What to consider before installing
This skill implements TTS/STT and largely does what it says, but take these precautions before installing or letting an agent run it: - Run it in an isolated environment (VM/container) because it will download and install heavy ML packages and models (torch, qwen-tts/asr, etc.), which use significant disk, memory, and network. - Ensure you have the 'uv' CLI and Python 3.10+ available — the SKILL.md uses 'uv run' but the registry metadata does not list 'uv' as a required binary. - Expect network access to Hugging Face and other endpoints (the code probes HF_ENDPOINT and can download models). If you need to avoid external network traffic, do not install or run the skill. - The script may auto-install missing Python packages via os.system('uv add ...') — this is a legitimate convenience but increases runtime privilege and attack surface. Review the pyproject.toml and the packages it will pull before proceeding. - Voices and other files are stored under ./voices/ and the skill will write to the skill folder; consider filesystem permissions and where you run it. - No credentials are requested, but environment variables (QWEN_AUDIO_DEVICE, QWEN_AUDIO_DTYPE, HF_ENDPOINT) influence behavior; these are not declared in the metadata and should be documented or locked down. If you need lower risk, ask the author to (1) declare required binaries and env vars explicitly, (2) remove runtime auto-installs or make them opt-in, and (3) document model download endpoints and disk requirements. Review the full scripts/qwen-audio.py before granting the skill autonomous invocation.

Like a lobster shell, security has layers — review code before you run it.

latestvk977017ffhh34jet63cc4zgwcx82fg5m
399downloads
1stars
3versions
Updated 1mo ago
v0.0.6
MIT-0

Qwen-Audio

Overview

Qwen-Audio is a high-performance audio processing library optimized. It delivers fast, efficient TTS and STT with support for multiple models, languages, and audio formats.

Prerequisites

  • Python 3.10+

Environment checks

Before using any capability, verify that all items in ./references/env-check-list.md are complete.

Capabilities

Voice Management

Voices are stored in the ./voices/ directory at the skill root level. Each voice has its own folder containing:

  • ref_audio.wav - Reference audio file
  • ref_text.txt - Reference text transcript
  • ref_instruct.txt - Voice style description

Create a Voice

Create a reusable voice profile using VoiceDesign model. The --instruct parameter is required to describe the voice style:

uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" voice create --text "This is a sample voice reference text." --instruct "A warm, friendly female voice with a professional tone." --id "my-voice-id"

Optional: --id "my-voice-id" to specify a custom voice ID.

Returns (JSON):

{
  "id": "my-voice-id",
  "ref_audio": "/<qwen-audio-skill-path>/voices/my-voice-id/ref_audio.wav",
  "ref_text": "This is a sample voice reference text.",
  "instruct": "A warm, friendly female voice with a professional tone.",
  "duration": 3.456,
  "sample_rate": 24000,
  "success": true
}

List Voices

List all created voice profiles:

uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" voice list

Returns (JSON):

[
  {
    "id": "my-voice-id",
    "ref_audio": "/<qwen-audio-skill-path>/voices/my-voice-id/ref_audio.wav",
    "ref_text": "This is a sample voice reference text.",
    "instruct": "A warm, friendly female voice with a professional tone.",
    "duration": 3.456,
    "sample_rate": 24000
  }
]

Text to Speech

TTS Voice Pre-check (Required)

Before any tts generation, always confirm the available voices first:

  1. Run voice list to check the current voice profiles.
  2. If the returned list is empty, stop and ask the user what kind of voice they want to create first. Offer style choices, for example:
    • Warm and friendly female narrator
    • Deep and steady male broadcast voice
    • Young and energetic neutral voice
    • Calm and professional customer-service voice Then run voice create only after the user confirms a style.
  3. If the returned list is not empty, show the available voice id values and ask the user to confirm which one should be used as the --ref_voice reference id for generation.

Only run tts after this confirmation step is complete.

uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" tts --text "hello world" --output "/path/to/save.wav"

Returns (JSON):

{
  "audio_path": "/path/to/save.wav",
  "duration": 1.234,
  "sample_rate": 24000,
  "success": true
}

Voice Cloning

Clone any voice using a reference audio sample. Provide the wav file and its transcript:

uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" tts --text "hello world" --output "/path/to/save.wav" --ref_audio "sample_audio.wav" --ref_text "This is what my voice sounds like."

ref_audio: reference audio to clone ref_text: transcript of the reference audio

Use a Created Voice

After creating a voice, use it for TTS with the --ref_voice parameter. The instruct will be automatically loaded:

uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" tts --text "New text to speak" --output "/path/to/save.wav" --ref_voice "my-voice-id" --instruct "Very happy and excited."

Optional: --instruct to emotion control.

Automatic Speech Recognition (STT)

uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" stt --audio "/sample_audio.wav" --output "/path/to/save.txt" --output-format txt

Test audio: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav output-format: "txt" | "ass" | "srt" | "all"

Returns (JSON):

{
  "text": "transcribed text content",
  "duration": 10.5,
  "sample_rate": 16000,
  "files": ["/path/to/save.txt", "/path/to/save.srt"],
  "success": true
}

Comments

Loading comments...