Install
openclaw skills install local-llm-routerLocal LLM model router for Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices. Self-hosted local LLM inference routing on macOS, Linux, and Windows. Local LLM 7-signal scoring engine picks the optimal machine for every local LLM request. OpenAI-compatible local LLM API with context protection, VRAM-aware fallback, and auto-retry. 本地LLM路由 inference router | LLM local enrutador de inferencia. Use when the user wants to optimize local LLM routing, reduce local LLM latency, or load balance local LLM across machines.
openclaw skills install local-llm-routerYou are managing a local LLM inference router that distributes local LLM requests across multiple Ollama instances using a 7-signal local LLM scoring engine.
You have multiple machines with GPUs but your local LLM inference scripts only talk to one. Switching local LLM models between machines means editing configs and restarting. There's no way to compare local LLM latency across nodes, no automatic local LLM failover, and no visibility into which machine handles which local LLM requests.
This local LLM router sits in front of your Ollama instances and picks the optimal device for every local LLM request — based on what local LLM models are hot in memory, how much headroom each machine has, how deep the local LLM queues are, and historical local LLM latency data. Drop-in compatible with the OpenAI SDK and Ollama API.
pip install ollama-herd # install the local LLM router
herd # launch the local LLM router (scores and routes)
herd-node # launch a local LLM node agent on each device
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
The local LLM router runs at http://localhost:11435 by default. Point any OpenAI-compatible client at http://localhost:11435/v1 for local LLM inference.
# local_llm_client — connect to the local LLM router
from openai import OpenAI
local_llm_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
local_llm_response = local_llm_client.chat.completions.create(
model="llama3.3:70b", # local LLM model
messages=[{"role": "user", "content": "Hello from local LLM"}],
stream=True,
)
Every local LLM request is scored across 7 signals:
When clients send num_ctx in local LLM requests, the local LLM router intercepts it to prevent Ollama from reloading models unnecessarily:
num_ctx <= loaded context: stripped (local LLM model already supports it)num_ctx > loaded context: auto-upgrades to a larger loaded local LLM model with sufficient contextFLEET_CONTEXT_PROTECTION (strip/warn/passthrough)# local_llm_fleet_status — all local LLM nodes and queues
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
# local_llm_model_list — every local LLM model on every node
curl -s http://localhost:11435/api/tags | python3 -m json.tool
# local_llm_hot_models — local LLM models in GPU memory
curl -s http://localhost:11435/api/ps | python3 -m json.tool
curl -s http://localhost:11435/v1/models | python3 -m json.tool
# local_llm_traces — recent local LLM routing decisions
curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool
Returns: local LLM model requested, node selected, score breakdown, latency, tokens, retry/fallback status.
curl -s http://localhost:11435/dashboard/api/models | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool
# Toggle local LLM auto-pull
curl -s -X POST http://localhost:11435/dashboard/api/settings \
-H "Content-Type: application/json" \
-d '{"auto_pull": false}'
# local_llm_model_inventory — per-node local LLM model details
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool
# Pull a local LLM model onto a specific node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
-H "Content-Type: application/json" \
-d '{"model": "llama3.3:70b", "node_id": "mac-studio"}'
# Delete a local LLM model from a specific node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
-H "Content-Type: application/json" \
-d '{"model": "old-model:7b", "node_id": "mac-studio"}'
curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool
Web dashboard at http://localhost:11435/dashboard with eight tabs: Local LLM Fleet Overview, Trends, Local LLM Model Insights, Apps, Benchmarks, Local LLM Health, Recommendations, Settings.
sqlite3 ~/.fleet-manager/latency.db "SELECT model, node_id, AVG(latency_ms)/1000.0 as avg_secs, COUNT(*) as n FROM request_traces WHERE status='completed' GROUP BY node_id, model HAVING n > 10 ORDER BY avg_secs DESC LIMIT 10"
sqlite3 ~/.fleet-manager/latency.db "SELECT node_id, model, AVG(time_to_first_token_ms) as avg_ttft FROM request_traces WHERE time_to_first_token_ms IS NOT NULL GROUP BY node_id, model"
sqlite3 ~/.fleet-manager/latency.db "SELECT model, CASE WHEN time_to_first_token_ms < 1000 THEN 'hot' ELSE 'cold' END as load_type, AVG(latency_ms)/1000.0 as avg_secs, COUNT(*) as n FROM request_traces WHERE status='completed' AND time_to_first_token_ms IS NOT NULL GROUP BY model, load_type ORDER BY model"
# local LLM via OpenAI format
curl -s http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from local LLM"}],"stream":false}'
# local LLM via Ollama format
curl -s http://localhost:11435/api/chat \
-d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from local LLM"}],"stream":false}'
num_ctx values, auto-upgrades to larger local LLM models~/.fleet-manager/ (contains local LLM latency data, traces, and logs).herd or uv run herdherd-node on devices--router-url http://router-ip:11435num_ctx in client requests; verify context protection~/.fleet-manager/logs/herd.jsonl