Install
openclaw skills install ollama-herdOllama multimodal model router for Llama, Qwen, DeepSeek, Phi, and Mistral — plus mflux image generation, speech-to-text, and embeddings. Self-hosted Ollama local AI (macOS, Linux, Windows) with 7-signal scoring, Ollama queue management, real-time dashboard, and Ollama health monitoring. Routes Ollama LLM, image, STT, and embedding requests across macOS, Linux, and Windows devices. Ollama本地推理路由 | Ollama enrutador IA local. Use when the user asks about their Ollama fleet, Ollama inference routing, Ollama node status, or Ollama fleet performance.
openclaw skills install ollama-herdYou are managing an Ollama Herd fleet — a smart Ollama multimodal router that distributes Ollama AI workloads across multiple devices. Ollama Herd handles 4 model types: Ollama LLM inference, image generation (mflux), speech-to-text (Qwen3-ASR), and Ollama embeddings. The Ollama scoring engine evaluates nodes on 7 signals (thermal state, memory fit, queue depth, latency history, role affinity, availability trend, context fit) and routes each Ollama request to the optimal device.
pip install ollama-herd # install Ollama Herd from PyPI
herd # start the Ollama router
herd-node # start an Ollama node agent (run on each device)
PyPI: ollama-herd | Source: github.com/geeks-accelerator/ollama-herd
The Ollama Herd router runs at http://localhost:11435 by default. If the user has specified a different Ollama URL, use that instead.
Use curl to interact with the Ollama fleet:
# ollama_fleet_status — check Ollama node health
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
Returns:
fleet.nodes_total / fleet.nodes_online — how many Ollama devices are in the fleetfleet.models_loaded — total Ollama models currently loaded across all nodesfleet.requests_active — total in-flight Ollama requestsnodes[] — per-node details: Ollama status, hardware, memory, CPU, disk, loaded Ollama models with context lengthsqueues — per Ollama node:model queue depths (pending, in-flight, done, failed)# ollama_model_list — all Ollama models on all nodes
curl -s http://localhost:11435/api/tags | python3 -m json.tool
# ollama_pull_model — pull a model (auto-selects best node, streams progress)
curl -N http://localhost:11435/api/pull -d '{"name": "codestral"}'
# pull to a specific node
curl -N http://localhost:11435/api/pull -d '{"name": "llama3.3:70b", "node_id": "mac-studio"}'
# non-streaming (blocks until complete)
curl http://localhost:11435/api/pull -d '{"name": "phi4", "stream": false}'
# ollama_loaded_models — hot Ollama models in GPU memory
curl -s http://localhost:11435/api/ps | python3 -m json.tool
curl -s http://localhost:11435/v1/models | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool
# ollama_traces — recent Ollama routing decisions
curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool
Returns the last N Ollama routing decisions with: model requested, node selected, score, latency, tokens, retry/fallback status, tags.
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
Returns 15 automated Ollama health checks: offline/degraded nodes, memory pressure, underutilized nodes, VRAM fallbacks, KV cache bloat (OLLAMA_NUM_PARALLEL too high), version mismatch, context protection, zombie reaper, Ollama model thrashing, request timeouts, error rates, retry rates, client disconnects, and incomplete streams.
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
Returns AI-powered Ollama model mix recommendations per node based on hardware capabilities, Ollama usage patterns, and curated benchmark data.
# View current Ollama config and node versions
curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool
# Toggle Ollama runtime settings (auto_pull, vram_fallback)
curl -s -X POST http://localhost:11435/dashboard/api/settings \
-H "Content-Type: application/json" \
-d '{"auto_pull": false}'
# View per-node Ollama model details with sizes and usage
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool
# Pull an Ollama model onto a specific node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
-H "Content-Type: application/json" \
-d '{"model": "llama3.3:70b", "node_id": "mac-studio"}'
# Delete an Ollama model from a specific node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
-H "Content-Type: application/json" \
-d '{"model": "old-model:7b", "node_id": "mac-studio"}'
curl -s http://localhost:11435/dashboard/api/models | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool
The Ollama web dashboard is at http://localhost:11435/dashboard. It has eight tabs:
Direct the user to open this URL in their browser for visual Ollama monitoring.
num_ctx from Ollama requests when unnecessary to prevent Ollama model reload hangs; auto-upgrades to a larger loaded model/fleet/status and verify nodes_online > 0/dashboard/api/health for automated Ollama health checks with severity levels/fleet/status and inspect each Ollama node's ollama.models_loaded and ollama.models_available/api/tags for a flat list of all available Ollama models with which nodes have them/api/ps — Ollama models listed here are currently loaded in memory (hot)/api/tags but not in /api/ps are on disk but not loaded (cold)/dashboard/api/traces?limit=10 to see the last 10 Ollama requests/dashboard/api/traces for high latency Ollama entries/fleet/status for Ollama nodes with high queue depths or memory pressurenum_ctx is being sent — Ollama context protection logs show if requests triggered reloads# Recent Ollama failures
sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, status, error_message FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 10"
# Slowest Ollama requests
sqlite3 ~/.fleet-manager/latency.db "SELECT model, node_id, latency_ms/1000.0 as secs FROM request_traces WHERE status='completed' ORDER BY latency_ms DESC LIMIT 10"
# Ollama via OpenAI format
curl -s http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from Ollama"}],"stream":false}'
# Ollama native format
curl -s http://localhost:11435/api/chat \
-d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from Ollama"}],"stream":false}'
~/.fleet-manager/ (contains Ollama latency data, traces, and logs).herd to start it.herd-node on their devices.--router-url http://router-ip:11435 for explicit connection.num_ctx — Ollama context protection should strip it.~/.fleet-manager/logs/herd.jsonl.