Local LLM Router. Local LLM本地路由. LLM Local Router.

v1.0.3

Local LLM model router for Llama, Qwen, DeepSeek, Phi, Mistral, and Gemma across multiple devices. Self-hosted local LLM inference routing on Mac Studio, Mac...

0· 86·0 current·0 all-time
byTwin Geeks@twinsgeeks
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
The name/description match the actions in SKILL.md: routing local LLM requests, scoring nodes, exposing an OpenAI-compatible local endpoint. Declared binaries (curl/wget, optional python3/pip/sqlite3) and configPaths (~/.fleet-manager/latency.db, logs/herd.jsonl) are appropriate for a local fleet manager.
Instruction Scope
Instructions are largely scoped to installing the ollama-herd package, running herd/herd-node, and querying local HTTP endpoints. However SKILL.md references the FLEET_CONTEXT_PROTECTION setting (an env var) and other behavior (SQLite latency history) without declaring that env var or explicitly documenting read/write to the listed config paths — the agent instructions therefore access configuration/state beyond what the metadata explicitly lists.
Install Mechanism
There is no registry-level install spec, but the runtime docs instruct the user to run 'pip install ollama-herd' (PyPI). Installing a third-party Python package from PyPI is expected for this functionality but carries standard supply-chain risk; the skill itself won't be written to disk by the registry since it's instruction-only.
Credentials
The skill declares no required credentials and none are requested in metadata. That matches expectations. Minor mismatch: SKILL.md references FLEET_CONTEXT_PROTECTION (strip/warn/passthrough) which is not listed under required env vars — this should be declared or documented. The metadata's configPaths indicate the skill will read/write ~/.fleet-manager data, which is proportionate for a fleet manager but worth noting.
Persistence & Privilege
The skill is not always-enabled, does not request elevated platform privileges, and does not modify other skills. Runtime instructions tell the user to install a package and run local daemons (herd, herd-node) which is normal for this class of tool.
Assessment
This skill appears to do what it says (route local LLM requests across machines), but it asks you to install a third-party Python package (pip install ollama-herd) and will create/read files under ~/.fleet-manager. Before installing: - Inspect the ollama-herd project (PyPI page and the GitHub repo) and review its source code, release history, and maintainers. - Run the package in a controlled environment (virtualenv or isolated machine) first so any daemons (herd, herd-node) cannot access sensitive systems. - Be aware it will persist data (latency.db, logs) under ~/.fleet-manager; if you need to keep those private, review/redirect the paths. - The SKILL.md refers to an env var FLEET_CONTEXT_PROTECTION that is not declared in the registry metadata — if you rely on that behavior, set and test it explicitly. - If you will deploy herd-node on multiple devices, ensure you understand its network requirements and firewall rules so nodes only join the intended fleet.

Like a lobster shell, security has layers — review code before you run it.

apple-siliconvk97by24jczwt5csxw3ef93h6qh841rwmcodestralvk975yg903zqh2v0yqab8khep3d83wvp9deepseekvk97by24jczwt5csxw3ef93h6qh841rwmgemmavk97by24jczwt5csxw3ef93h6qh841rwminference-routingvk97by24jczwt5csxw3ef93h6qh841rwmlatencyvk979hyhw72j4y3rjpqxpdnhz0983xfcclatestvk97by24jczwt5csxw3ef93h6qh841rwmllamavk97by24jczwt5csxw3ef93h6qh841rwmllm-routingvk974q9gbhr62sf3d6f9krxvppn83efp3load-balancingvk978zjqvv9r9a7j2pcg7mdvgjh83x4btlocal-aivk97by24jczwt5csxw3ef93h6qh841rwmlocal-llmvk97by24jczwt5csxw3ef93h6qh841rwmmac-minivk97by24jczwt5csxw3ef93h6qh841rwmmac-studiovk97by24jczwt5csxw3ef93h6qh841rwmmistralvk97by24jczwt5csxw3ef93h6qh841rwmmodel-routervk97by24jczwt5csxw3ef93h6qh841rwmmulti-nodevk978zjqvv9r9a7j2pcg7mdvgjh83x4btollamavk978zjqvv9r9a7j2pcg7mdvgjh83x4btphivk97by24jczwt5csxw3ef93h6qh841rwmqwenvk97by24jczwt5csxw3ef93h6qh841rwmscoringvk978zjqvv9r9a7j2pcg7mdvgjh83x4btself-hostedvk97by24jczwt5csxw3ef93h6qh841rwm

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

router Clawdis
OSmacOS · Linux
Any bincurl, wget

SKILL.md

Local LLM Router

You are managing a local LLM inference router that distributes local LLM requests across multiple Ollama instances using a 7-signal local LLM scoring engine.

What this local LLM router solves

You have multiple machines with GPUs but your local LLM inference scripts only talk to one. Switching local LLM models between machines means editing configs and restarting. There's no way to compare local LLM latency across nodes, no automatic local LLM failover, and no visibility into which machine handles which local LLM requests.

This local LLM router sits in front of your Ollama instances and picks the optimal device for every local LLM request — based on what local LLM models are hot in memory, how much headroom each machine has, how deep the local LLM queues are, and historical local LLM latency data. Drop-in compatible with the OpenAI SDK and Ollama API.

Setup Local LLM Router

pip install ollama-herd           # install the local LLM router
herd                              # launch the local LLM router (scores and routes)
herd-node                         # launch a local LLM node agent on each device

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Local LLM Router Endpoint

The local LLM router runs at http://localhost:11435 by default. Point any OpenAI-compatible client at http://localhost:11435/v1 for local LLM inference.

# local_llm_client — connect to the local LLM router
from openai import OpenAI
local_llm_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
local_llm_response = local_llm_client.chat.completions.create(
    model="llama3.3:70b",  # local LLM model
    messages=[{"role": "user", "content": "Hello from local LLM"}],
    stream=True,
)

Local LLM Scoring Engine

Every local LLM request is scored across 7 signals:

  1. Thermal state (+50 pts) — local LLM models already loaded in GPU memory ("hot") score highest
  2. Memory fit (+20 pts) — local LLM nodes with more available headroom score higher
  3. Queue depth (-30 pts) — busy local LLM nodes get penalized
  4. Latency history (-25 pts) — past p75 local LLM latency from SQLite informs expected wait
  5. Role affinity (+15 pts) — large local LLM models prefer big machines
  6. Availability trend (+10 pts) — local LLM nodes with stable availability patterns score higher
  7. Context fit (+15 pts) — local LLM nodes with loaded context windows that fit the estimated token count

Local LLM Context-size Protection

When clients send num_ctx in local LLM requests, the local LLM router intercepts it to prevent Ollama from reloading models unnecessarily:

  • num_ctx <= loaded context: stripped (local LLM model already supports it)
  • num_ctx > loaded context: auto-upgrades to a larger loaded local LLM model with sufficient context
  • Configurable via FLEET_CONTEXT_PROTECTION (strip/warn/passthrough)

Local LLM API Endpoints

Local LLM Fleet Status

# local_llm_fleet_status — all local LLM nodes and queues
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

List all local LLM models across the fleet

# local_llm_model_list — every local LLM model on every node
curl -s http://localhost:11435/api/tags | python3 -m json.tool

Local LLM models currently loaded in memory (hot)

# local_llm_hot_models — local LLM models in GPU memory
curl -s http://localhost:11435/api/ps | python3 -m json.tool

OpenAI-compatible local LLM model list

curl -s http://localhost:11435/v1/models | python3 -m json.tool

Local LLM Request Traces (routing decisions)

# local_llm_traces — recent local LLM routing decisions
curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool

Returns: local LLM model requested, node selected, score breakdown, latency, tokens, retry/fallback status.

Local LLM Model Performance

curl -s http://localhost:11435/dashboard/api/models | python3 -m json.tool

Local LLM Usage Statistics

curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

Local LLM Fleet Health

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

Local LLM Model Recommendations

curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

Local LLM Settings

curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool

# Toggle local LLM auto-pull
curl -s -X POST http://localhost:11435/dashboard/api/settings \
  -H "Content-Type: application/json" \
  -d '{"auto_pull": false}'

Local LLM Model Management

# local_llm_model_inventory — per-node local LLM model details
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

# Pull a local LLM model onto a specific node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.3:70b", "node_id": "mac-studio"}'

# Delete a local LLM model from a specific node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
  -H "Content-Type: application/json" \
  -d '{"model": "old-model:7b", "node_id": "mac-studio"}'

Per-app local LLM analytics

curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool

Local LLM Dashboard

Web dashboard at http://localhost:11435/dashboard with eight tabs: Local LLM Fleet Overview, Trends, Local LLM Model Insights, Apps, Benchmarks, Local LLM Health, Recommendations, Settings.

Optimizing Local LLM Latency

Find the slowest local LLM model/node combinations

sqlite3 ~/.fleet-manager/latency.db "SELECT model, node_id, AVG(latency_ms)/1000.0 as avg_secs, COUNT(*) as n FROM request_traces WHERE status='completed' GROUP BY node_id, model HAVING n > 10 ORDER BY avg_secs DESC LIMIT 10"

Check local LLM time-to-first-token

sqlite3 ~/.fleet-manager/latency.db "SELECT node_id, model, AVG(time_to_first_token_ms) as avg_ttft FROM request_traces WHERE time_to_first_token_ms IS NOT NULL GROUP BY node_id, model"

Compare hot vs cold local LLM load latency

sqlite3 ~/.fleet-manager/latency.db "SELECT model, CASE WHEN time_to_first_token_ms < 1000 THEN 'hot' ELSE 'cold' END as load_type, AVG(latency_ms)/1000.0 as avg_secs, COUNT(*) as n FROM request_traces WHERE status='completed' AND time_to_first_token_ms IS NOT NULL GROUP BY model, load_type ORDER BY model"

Test local LLM inference

# local LLM via OpenAI format
curl -s http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from local LLM"}],"stream":false}'

# local LLM via Ollama format
curl -s http://localhost:11435/api/chat \
  -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from local LLM"}],"stream":false}'

Local LLM Resilience

  • Auto-retry — re-scores and retries on the next-best local LLM node if failure occurs before the first chunk
  • Local LLM model fallbacks — specify backup local LLM models; tries alternatives when the primary is unavailable
  • Local LLM context protection — strips dangerous num_ctx values, auto-upgrades to larger local LLM models
  • VRAM-aware local LLM fallback — routes to an already-loaded local LLM model in the same category
  • Zombie reaper — detects and cleans up stuck in-flight local LLM requests
  • Local LLM auto-pull — pulls missing local LLM models onto the best available node

Local LLM Guardrails

  • Never restart or stop the local LLM router or node agents without explicit user confirmation.
  • Never delete or modify files in ~/.fleet-manager/ (contains local LLM latency data, traces, and logs).
  • Do not pull or delete local LLM models without user confirmation — downloads can be 10-100+ GB.
  • If a local LLM node shows as offline, report it rather than attempting to SSH into the machine.

Local LLM Failure Handling

  • Connection refused → local LLM router may not be running, suggest herd or uv run herd
  • 0 local LLM nodes online → suggest starting herd-node on devices
  • mDNS discovery fails → use --router-url http://router-ip:11435
  • Local LLM requests hang → check for num_ctx in client requests; verify context protection
  • Local LLM API errors → check ~/.fleet-manager/logs/herd.jsonl

Files

1 total
Select a file
Select a file to preview.

Comments

Loading comments…