Qwen Qwen3

Other

Qwen Qwen3 — run Qwen3.5, Qwen3, Qwen3-Coder, Qwen2.5-Coder, and Qwen3-ASR across your local fleet. LLM inference, code generation, and speech-to-text from Alibaba's Qwen family via Ollama Herd. Cross-platform (macOS, Linux, Windows). Zero cloud costs.

Install

openclaw skills install qwen-qwen3

Qwen — Run Qwen Models Across Your Local Fleet

Run Qwen3.5, Qwen3, Qwen3-Coder, and Qwen ASR on your own hardware. The fleet router picks the best device for every request — chat, code generation, and speech-to-text from one endpoint.

Supported Qwen models

LLM (Chat & Reasoning)

ModelParametersOllama nameBest for
Qwen3.50.8B–397B MoEqwen3.5Latest — multimodal, best reasoning
Qwen30.6B–235B MoEqwen3Competitive with GPT-4o
Qwen2.50.5B–72Bqwen2.5Proven, stable, multilingual

Code Generation

ModelParametersOllama nameBest for
Qwen3-Coder30B MoE (3.3B active)qwen3-coderAgentic coding workflows
Qwen2.5-Coder0.5B–32Bqwen2.5-coderCode — matches GPT-4o at 32B

Speech-to-Text

ModelParametersToolBest for
Qwen3-ASR0.6B–1.7Bmlx-qwen3-asrState-of-the-art local transcription

Setup

pip install ollama-herd
herd              # start the router (port 11435)
herd-node         # run on each machine

# Pull Qwen models
ollama pull qwen3.5:32b
ollama pull qwen3-coder

For speech-to-text:

uv tool install "mlx-qwen3-asr[serve]" --python 3.14
curl -X POST http://localhost:11435/dashboard/api/settings \
  -H "Content-Type: application/json" -d '{"transcription": true}'

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Use Qwen through the fleet

OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")

# Qwen3.5 for general chat
response = client.chat.completions.create(
    model="qwen3.5:32b",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Qwen3-Coder for code

response = client.chat.completions.create(
    model="qwen3-coder",
    messages=[{"role": "user", "content": "Write a FastAPI CRUD app with SQLAlchemy"}],
)
print(response.choices[0].message.content)

Qwen ASR for transcription

curl http://localhost:11435/api/transcribe -F "audio=@meeting.wav"
import httpx

def transcribe(audio_path):
    with open(audio_path, "rb") as f:
        resp = httpx.post(
            "http://localhost:11435/api/transcribe",
            files={"audio": (audio_path, f)},
            timeout=300.0,
        )
    resp.raise_for_status()
    return resp.json()["text"]

Ollama API

# Qwen3.5 chat
curl http://localhost:11435/api/chat -d '{
  "model": "qwen3.5:32b",
  "messages": [{"role": "user", "content": "Explain transformers"}],
  "stream": false
}'

# Qwen2.5-Coder
curl http://localhost:11435/api/chat -d '{
  "model": "qwen2.5-coder:32b",
  "messages": [{"role": "user", "content": "Optimize this SQL query: ..."}],
  "stream": false
}'

Hardware recommendations

Cross-platform: These are example configurations. Any device (Mac, Linux, Windows) with equivalent RAM works. The fleet router runs on all platforms.

ModelMin RAMRecommended hardware
qwen3.5:0.8b2GBAny Mac
qwen3.5:9b8GBMac Mini M4 (16GB)
qwen3.5:32b24GBMac Mini M4 Pro (48GB)
qwen3.5:122b-a10b64GBMac Studio M4 Max (128GB)
qwen3.5:397b-a17b256GB+Mac Studio M3 Ultra (512GB)
qwen3-coder24GBMac Mini M4 Pro (48GB)
qwen2.5-coder:32b24GBMac Mini M4 Pro (48GB)
Qwen3-ASR (0.6B)1.2GBAny Mac
Qwen3-ASR (1.7B)3.4GBAny Mac (8GB+)

Why run Qwen locally

  • Zero cost — no per-token charges for Qwen API
  • Privacy — Chinese and English content stays on your devices
  • Full Qwen family — chat, code, reasoning, and speech-to-text from one fleet
  • No rate limits — Alibaba Cloud throttles API access. Local runs unlimited
  • Fleet routing — multiple machines share the load. The router picks the fastest available

The Qwen advantage on this fleet

Qwen models are uniquely suited for fleet routing:

  • MoE architecture — Qwen3.5 (397B total, 17B active) and Qwen3-Coder (30B total, 3.3B active) use Mixture of Experts. Only a fraction of parameters activate per request, making them fast despite large total size.
  • Size variety — from 0.6B to 397B, there's a Qwen model for every device in your fleet. Small Macs run the small models, big Macs run the big ones.
  • Code + Chat + STT — Qwen covers three modalities. One vendor, one fleet, three capabilities.

Also available on this fleet

Other LLM models

Llama 3.3, DeepSeek-V3, DeepSeek-R1, Phi 4, Mistral, Gemma 3 — any Ollama model routes through the same endpoint.

Image generation

curl -o image.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model":"z-image-turbo","prompt":"a sunset","width":1024,"height":1024,"steps":4}'

Embeddings

curl http://localhost:11435/api/embeddings -d '{"model":"nomic-embed-text","prompt":"query"}'

Dashboard

http://localhost:11435/dashboard — monitor Qwen requests alongside all other models. Per-model latency, token throughput, error rates, health checks.

Full documentation

Agent Setup Guide

Guardrails

  • Never pull or delete Qwen models without user confirmation.
  • Never delete or modify files in ~/.fleet-manager/.
  • If a Qwen model is too large for available memory, suggest a smaller variant or MoE version.