Install
openclaw skills install gpu-cluster-managerTurn your spare GPUs into one inference endpoint. Auto-discovers machines on your network, routes requests to the best available device, learns when your mac...
openclaw skills install gpu-cluster-managerYou are managing a GPU cluster that combines multiple machines into one inference endpoint for running local LLMs via Ollama. The GPU cluster routes every request to the best available device automatically.
Your desktop, laptop, and maybe an old Linux box all have GPUs sitting idle most of the time. You want one GPU cluster URL that uses all of them — without Kubernetes, without Docker, without editing config files. Just point your AI apps at the GPU cluster endpoint and let the cluster figure out which machine should handle each request.
This GPU cluster manager does exactly that. Install it, run two commands, and your GPU cluster machines discover each other automatically. The GPU cluster learns when your devices are free, pauses during video calls, and picks the best GPU cluster node for every request based on real-time conditions.
pip install ollama-herd # GPU cluster manager from PyPI
On your main GPU cluster machine (the router):
herd # starts GPU cluster router
On each other GPU cluster machine:
herd-node # joins the GPU cluster automatically
That's it. The GPU cluster nodes find the router via mDNS. No config files. Your GPU cluster is running.
If mDNS doesn't work on your GPU cluster network:
herd-node --router-url http://router-ip:11435
Your GPU cluster runs at http://localhost:11435. Point any AI app at the GPU cluster:
from openai import OpenAI
# GPU cluster client
gpu_cluster_client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
gpu_cluster_response = gpu_cluster_client.chat.completions.create(
model="llama3.3:70b",
messages=[{"role": "user", "content": "Explain GPU cluster routing for AI inference"}]
)
Works with: LangChain, CrewAI, AutoGen, LlamaIndex, Aider, Cline, Continue.dev, and any OpenAI-compatible client pointing at the GPU cluster.
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
curl -s http://localhost:11435/api/tags | python3 -m json.tool
curl -s http://localhost:11435/api/ps | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
Returns GPU cluster recommendations based on your hardware — which models fit, which are too big, and the optimal GPU cluster mix.
curl -s "http://localhost:11435/dashboard/api/traces?limit=10" | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool
curl -s -X POST http://localhost:11435/dashboard/api/settings \
-H "Content-Type: application/json" \
-d '{"auto_pull": false}'
# What's on each GPU cluster node
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool
# Download a model to a specific GPU cluster node
curl -s -X POST http://localhost:11435/dashboard/api/pull \
-H "Content-Type: application/json" \
-d '{"model": "llama3.3:70b", "node_id": "gpu-cluster-studio"}'
# Remove a model from a GPU cluster node
curl -s -X POST http://localhost:11435/dashboard/api/delete \
-H "Content-Type: application/json" \
-d '{"model": "old-model:7b", "node_id": "gpu-cluster-studio"}'
curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool
Tag your GPU cluster requests to see which apps use the most time:
curl -s http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Summarize GPU cluster utilization"}],"metadata":{"tags":["gpu-cluster-app"]}}'
Open http://localhost:11435/dashboard for a visual GPU cluster overview. Eight tabs: Fleet Overview (live GPU cluster node cards), Trends (charts), Model Insights (performance comparison), Apps (per-app usage), Benchmarks, Health (automated GPU cluster checks), Recommendations (what models to run), Settings.
# Quick GPU cluster test
curl -s http://localhost:11435/api/chat \
-d '{"model":"llama3.2:3b","messages":[{"role":"user","content":"Hello from the GPU cluster!"}],"stream":false}'
sqlite3 ~/.fleet-manager/latency.db "SELECT model, node_id, AVG(latency_ms)/1000.0 as avg_secs, COUNT(*) as n FROM request_traces WHERE status='completed' GROUP BY node_id, model HAVING n > 5 ORDER BY avg_secs DESC LIMIT 10"
sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, status, error_message, latency_ms/1000.0 as secs FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 10"
~/.fleet-manager/ (contains all your GPU cluster data and logs).herd or uv run herdherd-node on GPU cluster devices--router-url http://router-ip:11435num_ctx in client requests; context protection handles it~/.fleet-manager/logs/herd.jsonl