Install
openclaw skills install ollama-load-balancerOllama load balancer for Llama, Qwen, DeepSeek, and Mistral inference across multiple machines. Load balancing with auto-discovery via mDNS, health checks, queue management, automatic failover, retry on node failure, and zombie request cleanup. Zero configuration. 负载均衡Ollama推理分发。Balanceador de carga Ollama para inferencia distribuida.
openclaw skills install ollama-load-balancerYou are managing an Ollama load balancer that distributes inference requests across multiple Ollama instances with automatic discovery, health monitoring, and failover. The load balancer handles all routing decisions transparently.
Ollama has no built-in load balancing. One machine goes down, your app gets errors. No health checks, no failover, no queue management. You're manually pointing clients at specific machines and hoping they stay up.
This load balancer auto-discovers Ollama instances via mDNS, monitors their health continuously, and distributes load based on real-time scoring. The load balancer automatically retries on failure. Zero config files. Zero Docker. pip install ollama-herd, run two commands, and load balancing is active.
pip install ollama-herd
herd # start the load balancer on port 11435
herd-node # start load balancer backend node on each machine
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
The load balancer runs at http://localhost:11435. Drop-in replacement for direct Ollama connections — same API, same model names, with load balancing built in.
from openai import OpenAI
# Load balancer client — requests are balanced across all backend nodes
load_balancer_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
load_balanced_response = load_balancer_client.chat.completions.create(
model="llama3.3:70b",
messages=[{"role": "user", "content": "Explain load balancing for LLM inference"}]
)
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
The load balancer checks: offline nodes, degraded nodes, memory pressure, underutilized nodes, model thrashing, request timeouts, error rates. Each load balancer check returns severity (info/warning/critical) and recommendations.
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
Returns per-node: status (online/degraded/offline), CPU utilization, memory usage, loaded models with context lengths, and load balancer queue depths (pending/in-flight/done/failed).
curl -s http://localhost:11435/fleet/status | python3 -c "
import sys, json
# Load balancer queue inspection
data = json.load(sys.stdin)
for key, q in data.get('queues', {}).items():
print(f\"{key}: {q['pending']} pending, {q['in_flight']}/{q['max_concurrent']} in-flight\")
"
FLEET_MAX_RETRIES)num_ctx parameters that would trigger model reloads# All models across the load-balanced fleet
curl -s http://localhost:11435/api/tags | python3 -m json.tool
# Models currently loaded in load balancer backend memory
curl -s http://localhost:11435/api/ps | python3 -m json.tool
# OpenAI-compatible model list via load balancer
curl -s http://localhost:11435/v1/models | python3 -m json.tool
curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
# View load balancer config
curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool
# Toggle load balancer features
curl -s -X POST http://localhost:11435/dashboard/api/settings \
-H "Content-Type: application/json" \
-d '{"auto_pull": false}'
# View per-node model details behind the load balancer
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool
# Pull a model to a load balancer backend node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
-H "Content-Type: application/json" \
-d '{"model": "llama3.3:70b", "node_id": "load-balancer-node-1"}'
# Delete a model from a load balancer node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
-H "Content-Type: application/json" \
-d '{"model": "old-model:7b", "node_id": "load-balancer-node-1"}'
curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool
Web dashboard at http://localhost:11435/dashboard with eight tabs: Fleet Overview, Trends, Model Insights, Apps, Benchmarks, Health, Recommendations, Settings. All load balancer data updates in real-time via Server-Sent Events.
sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, status, error_message, latency_ms/1000.0 as secs FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 10"
sqlite3 ~/.fleet-manager/latency.db "SELECT node_id, SUM(retry_count) as retries, COUNT(*) as total FROM request_traces GROUP BY node_id ORDER BY retries DESC"
sqlite3 ~/.fleet-manager/latency.db "SELECT CAST((timestamp % 86400) / 3600 AS INTEGER) as hour, COUNT(*) as requests FROM request_traces GROUP BY hour ORDER BY hour"
curl -s http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Test load balancing across nodes"}],"stream":false}'
curl -s http://localhost:11435/api/chat \
-d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Verify load balancer routing"}],"stream":false}'
~/.fleet-manager/ (contains load balancer latency data, traces, and logs).herd or uv run herdherd-node on load balancer backend devices--router-url http://router-ip:11435num_ctx in client requests; verify with grep "Context protection" ~/.fleet-manager/logs/herd.jsonl~/.fleet-manager/logs/herd.jsonl