Install
openclaw skills install ai-devops-toolkitOperational tooling for teams running local LLM infrastructure. Request tracing with full scoring breakdowns, per-application usage analytics via request tag...
openclaw skills install ai-devops-toolkitDevOps tooling for running local LLM inference at production quality. This DevOps skill provides the observability, tracing, and health monitoring layer for an Ollama Herd fleet. Every DevOps workflow — from request tracing to capacity planning — runs through a single SQLite-backed observability stack.
pip install ollama-herd
herd # start the DevOps router (exposes all DevOps observability endpoints)
herd-node # start on each DevOps-monitored node
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
This DevOps toolkit assumes you have an Ollama Herd router running at http://localhost:11435 with one or more node agents reporting in. It focuses on the DevOps operational side: are requests succeeding? what's slow? which apps consume the most tokens? are nodes healthy? is capacity adequate?
Everything in this DevOps observability layer is backed by SQLite at ~/.fleet-manager/latency.db. No external databases, no time-series infrastructure. Query DevOps traces with standard sqlite3.
~/.fleet-manager/
├── latency.db # DevOps traces, latency history, usage stats
└── logs/
└── herd.jsonl # DevOps structured logs, daily rotation, 30-day retention
devops_health=$(curl -s http://localhost:11435/dashboard/api/health)
echo "$devops_health" | python3 -m json.tool
Fifteen DevOps checks, each returning a severity (info/warning/critical) and recommendation:
| DevOps Check | What it detects |
|---|---|
| Offline nodes | Nodes that stopped sending heartbeats |
| Degraded nodes | Nodes reporting errors or high memory pressure |
| Memory pressure | Nodes approaching memory limits |
| Underutilized nodes | Healthy nodes not receiving traffic |
| VRAM fallbacks | Requests rerouted to loaded alternatives to avoid cold loads |
| Version mismatch | Nodes running different versions than the router |
| Context protection | num_ctx values stripped or models upgraded to prevent reloads |
| Zombie reaper | Stuck in-flight requests cleaned up |
| Model thrashing | Models loading/unloading frequently (memory contention) |
| Request timeouts | Requests exceeding expected DevOps latency thresholds |
| Error rates | Elevated failure rates per model or per node |
devops_fleet_status=$(curl -s http://localhost:11435/fleet/status)
echo "$devops_fleet_status" | python3 -c "
import sys, json
d = json.load(sys.stdin)
print(f\"DevOps Fleet: {d['fleet']['nodes_online']}/{d['fleet']['nodes_total']} online, {d['fleet']['requests_active']} active requests\")
for n in d['nodes']:
mem = n.get('memory', {})
cpu = n.get('cpu', {})
print(f\" {n['node_id']:20s} {n['status']:10s} CPU={cpu.get('utilization_pct',0):.0f}% MEM={mem.get('used_gb',0):.0f}/{mem.get('total_gb',0):.0f}GB pressure={mem.get('pressure','?')}\")
"
Every DevOps routing decision is recorded with full observability context.
devops_traces=$(curl -s "http://localhost:11435/dashboard/api/traces?limit=20")
echo "$devops_traces" | python3 -m json.tool
Each DevOps trace includes: request_id, model, original_model (before fallback), node_id, score, scores_breakdown (all 7 signals), status, latency_ms, time_to_first_token_ms, prompt_tokens, completion_tokens, retry_count, fallback_used, tags.
# Recent DevOps failures with error details
sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, node_id, error_message, latency_ms/1000.0 as secs, datetime(timestamp, 'unixepoch', 'localtime') as time FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 20"
# DevOps retry frequency — which nodes need attention?
sqlite3 ~/.fleet-manager/latency.db "SELECT node_id, SUM(retry_count) as retries, COUNT(*) as total, ROUND(100.0 * SUM(CASE WHEN status='failed' THEN 1 ELSE 0 END) / COUNT(*), 1) as fail_pct FROM request_traces GROUP BY node_id ORDER BY fail_pct DESC"
# DevOps fallback frequency — which models are unreliable?
sqlite3 ~/.fleet-manager/latency.db "SELECT original_model, model as fell_back_to, COUNT(*) as n FROM request_traces WHERE fallback_used=1 GROUP BY original_model, model ORDER BY n DESC"
# DevOps P50/P75/P99 latency by model
sqlite3 ~/.fleet-manager/latency.db "
WITH ranked AS (
SELECT model, latency_ms,
PERCENT_RANK() OVER (PARTITION BY model ORDER BY latency_ms) as pct
FROM request_traces WHERE status='completed'
)
SELECT model,
ROUND(MIN(CASE WHEN pct >= 0.5 THEN latency_ms END)/1000.0, 1) as p50_s,
ROUND(MIN(CASE WHEN pct >= 0.75 THEN latency_ms END)/1000.0, 1) as p75_s,
ROUND(MIN(CASE WHEN pct >= 0.99 THEN latency_ms END)/1000.0, 1) as p99_s,
COUNT(*) as n
FROM ranked GROUP BY model HAVING n > 10 ORDER BY p75_s DESC
"
# DevOps time-to-first-token observability (cold load detection)
sqlite3 ~/.fleet-manager/latency.db "SELECT node_id, model, ROUND(AVG(time_to_first_token_ms), 0) as avg_ttft_ms, ROUND(MAX(time_to_first_token_ms), 0) as max_ttft_ms, COUNT(*) as n FROM request_traces WHERE time_to_first_token_ms IS NOT NULL GROUP BY node_id, model HAVING n > 5 ORDER BY avg_ttft_ms DESC"
# DevOps outlier detection — slowest requests
sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, node_id, ROUND(latency_ms/1000.0, 1) as secs, prompt_tokens, completion_tokens, retry_count, datetime(timestamp, 'unixepoch', 'localtime') as time FROM request_traces WHERE status='completed' ORDER BY latency_ms DESC LIMIT 10"
Tag requests to track DevOps usage per application, team, or environment.
# DevOps tag via request body
curl -s http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello"}],"metadata":{"tags":["devops-prod","devops-code-review"]}}'
# DevOps tag via header
curl -s -H "X-Herd-Tags: devops-prod, devops-code-review" \
http://localhost:11435/v1/chat/completions \
-d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello"}]}'
curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/apps/daily | python3 -m json.tool
sqlite3 ~/.fleet-manager/latency.db "SELECT j.value as devops_tag, COUNT(*) as requests, SUM(COALESCE(prompt_tokens,0)) as prompt_tok, SUM(COALESCE(completion_tokens,0)) as completion_tok, SUM(COALESCE(prompt_tokens,0)+COALESCE(completion_tokens,0)) as total_tok FROM request_traces, json_each(tags) j WHERE tags IS NOT NULL GROUP BY j.value ORDER BY total_tok DESC"
# DevOps requests per hour (find peak load times)
sqlite3 ~/.fleet-manager/latency.db "SELECT CAST((timestamp % 86400) / 3600 AS INTEGER) as hour_utc, COUNT(*) as requests, ROUND(AVG(latency_ms)/1000.0, 1) as avg_secs FROM request_traces GROUP BY hour_utc ORDER BY hour_utc"
# DevOps daily request volume
sqlite3 ~/.fleet-manager/latency.db "SELECT date(timestamp, 'unixepoch') as day, COUNT(*) as requests, SUM(COALESCE(prompt_tokens,0)+COALESCE(completion_tokens,0)) as tokens FROM request_traces GROUP BY day ORDER BY day DESC LIMIT 14"
devops_recommendations=$(curl -s http://localhost:11435/dashboard/api/recommendations)
echo "$devops_recommendations" | python3 -m json.tool
Returns DevOps recommendations based on hardware capabilities, current usage, and curated benchmark data. Use for DevOps capacity planning: which models fit on which machines, and what's the optimal mix.
curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool
# View all DevOps settings
curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool
# Toggle DevOps runtime settings
curl -s -X POST http://localhost:11435/dashboard/api/settings \
-H "Content-Type: application/json" \
-d '{"auto_pull": false}'
Structured JSONL logs at ~/.fleet-manager/logs/herd.jsonl — the DevOps log layer:
# Recent DevOps errors
grep '"level": "ERROR"' ~/.fleet-manager/logs/herd.jsonl | tail -10 | python3 -m json.tool
# DevOps context protection events
grep "Context protection" ~/.fleet-manager/logs/herd.jsonl | tail -10
# DevOps stream errors
grep "Stream error" ~/.fleet-manager/logs/herd.jsonl | tail -10
Web dashboard at http://localhost:11435/dashboard. Key DevOps tabs:
~/.fleet-manager/ contents.herd or uv run herd.