Install
openclaw skills install ollama-fleet-routerOllama fleet router — herd your Ollama LLMs into one smart endpoint. Route Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices with 7-signal scoring, auto-retry, VRAM-aware fallback, and context protection. Plus image generation, speech-to-text, and embeddings. Drop-in OpenAI SDK compatible.
openclaw skills install ollama-fleet-routerYou have Ollama running on multiple machines. This skill gives you one endpoint that routes every request to the best available device automatically. No more hardcoding IPs, no more manual load balancing, no more "which machine has that model loaded?"
pip install ollama-herd
herd # start the router on port 11435
herd-node # run on each machine with Ollama
Now point everything at http://localhost:11435 instead of http://localhost:11434. Same Ollama API, same models, smarter routing.
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
model="llama3.3:70b",
messages=[{"role": "user", "content": "Hello"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")
# Chat
curl http://localhost:11435/api/chat -d '{
"model": "qwen3:235b",
"messages": [{"role": "user", "content": "Hello"}],
"stream": false
}'
# List all models across all machines
curl http://localhost:11435/api/tags
# Models currently in GPU memory
curl http://localhost:11435/api/ps
# Embeddings
curl http://localhost:11435/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "search query"
}'
When a request comes in, the router scores every online node on 7 signals:
The highest-scoring node handles the request. If it fails, the router retries on the next best node automatically.
Any model that runs on Ollama works through the fleet. Popular ones:
| Model | Sizes | Best for |
|---|---|---|
llama3.3 | 8B, 70B | General purpose |
qwen3 | 0.6B–235B | Multilingual, reasoning |
qwen3.5 | 0.8B–397B | Latest generation |
deepseek-v3 | 671B (37B active) | Matches GPT-4o |
deepseek-r1 | 1.5B–671B | Reasoning (like o3) |
phi4 | 14B | Small, fast, capable |
mistral | 7B | Fast, European languages |
gemma3 | 1B–27B | Google's open model |
codestral | 22B | Code generation |
qwen3-coder | 30B (3.3B active) | Agentic coding |
nomic-embed-text | 137M | Embeddings for RAG |
num_ctx from triggering expensive model reloadsThe same fleet router handles three more workloads:
curl -o image.png http://localhost:11435/api/generate-image \
-H "Content-Type: application/json" \
-d '{"model":"z-image-turbo","prompt":"a sunset","width":1024,"height":1024,"steps":4}'
Enable: curl -X POST .../dashboard/api/settings -d '{"image_generation":true}'
curl http://localhost:11435/api/transcribe -F "audio=@recording.wav"
Enable: curl -X POST .../dashboard/api/settings -d '{"transcription":true}'
curl http://localhost:11435/api/embeddings -d '{"model":"nomic-embed-text","prompt":"text"}'
Already enabled — routes through Ollama automatically.
http://localhost:11435/dashboard — 8 tabs: Fleet Overview, Trends, Model Insights, Apps, Benchmarks, Health, Recommendations, Settings. Real-time queue visibility with [TEXT], [IMAGE], [STT], [EMBED] badges.
Track per-project usage:
response = client.chat.completions.create(
model="llama3.3:70b",
messages=messages,
extra_body={"metadata": {"tags": ["my-project", "reasoning"]}},
)
~/.fleet-manager/.