Install
openclaw skills install @praanmichael/stepfun-step-audio-r1-1Use StepFun Chat Completions with model step-audio-r1.1 for non-streaming speech turns that can send text with optional local audio input and save the returned audio, transcript, and raw response object.
openclaw skills install @praanmichael/stepfun-step-audio-r1-1Call StepFun's POST /v1/chat/completions endpoint with stream: false and
model: step-audio-r1.1.
Use this skill when the user explicitly wants StepFun audio generation,
speech-style replies through the Chat API, a standard non-streaming chat
completion object, or local audio input encoded as input_audio.
Do not use this skill for realtime duplex voice sessions. Use StepFun Realtime API instead when the user wants low-latency live conversation.
step-audio-r1.1 does not support tool call. If the user needs tool calling,
prefer step-audio-2 instead of this skill.
Text in, audio out:
python3 {baseDir}/scripts/stepfun_audio_chat.py \
--prompt "用中文介绍一下苏州的春天,语气自然一点。" \
--voice wenrounansheng \
--format wav
Check available voice ids before a run:
python3 {baseDir}/scripts/stepfun_audio_chat.py \
--list-voices
Text + local audio in, audio out:
python3 {baseDir}/scripts/stepfun_audio_chat.py \
--prompt "听完这段语音后,总结重点,并用更简洁的话复述。" \
--input-audio /path/to/input.wav \
--voice wenrounansheng \
--format wav
Build and inspect the non-streaming request without sending it:
python3 {baseDir}/scripts/stepfun_audio_chat.py \
--prompt "测试 step-audio-r1.1 非流式 payload" \
--dry-run \
--print-json
The helper writes a fresh output directory for each run unless --output-dir
is provided. Typical files are:
request.json: saved only for --dry-runresponse.json: full non-streaming response objectresponse.<format>: decoded audio from choices[0].message.audio.datatranscript.txt: choices[0].message.audio.transcriptcontent.txt: textual assistant content when presentpython3 {baseDir}/scripts/stepfun_audio_chat.py --help
Important flags:
--prompt: user text to send with the request--input-audio: local audio file that will be base64-encoded into
input_audio; non-WAV files are converted to WAV first when ffmpeg or
afconvert is available--system: optional system instruction--voice: output voice name--list-voices: query StepFun for account-level custom/cloned voices and
print a few official voice hints--format: non-streaming output audio format; this skill uses wav--no-audio-output: request text-only output while still using the Chat API--temperature: optional sampling override--max-tokens: optional generation cap--print-json: echo request or response JSON to stdout--dry-run: build payload and stop before the network callSet STEPFUN_API_KEY in the environment, or inject it through OpenClaw skill
config:
{
skills: {
entries: {
"stepfun-step-audio-r1-1": {
env: {
STEPFUN_API_KEY: "STEP_KEY_HERE",
},
},
},
},
}
The script still accepts STEP_API_KEY as a legacy alias for backward
compatibility, but the official name is STEPFUN_API_KEY.
Optional environment variables:
STEP_API_BASE_URL: overrides the default https://api.stepfun.comInput audio note:
input_audio.data in data:audio/wav;base64,... formatm4a, mp3, aiff, or similar, this script will try to convert
to WAV via ffmpeg or macOS afconvertinput_audio payload must stay within StepFun's 10MB base64
limitVoice selection note:
step-audio-r1.1 needs audio.voice whenever you request audio outputwenrounansheng, which was validated in real smoke
tests for this skill--voice explicitly--list-voices to inspect account-level custom/cloned voice idsstep-audio-r1.1, step-audio-2, and step-tts-* differ in voice usagestep-audio-r1.1 through Chat API.stream: false.Read references/stepfun-chat-api.md when you need the exact request shape, supported audio fields, or the non-streaming response layout. Read references/stepfun-voices.md when you need voice-selection guidance.