Install
openclaw skills install keldron-agentGPU monitoring with risk intelligence. Local + cloud fleet monitoring, health tracking, proactive alerts, and AI-powered fleet analytics.
openclaw skills install keldron-agentKeldron Agent is a vendor-neutral GPU monitoring agent that runs locally and exposes real-time telemetry and risk scores via a Prometheus endpoint. It supports Apple Silicon (M1–M5), NVIDIA consumer GPUs (RTX 3090/4090/5090), NVIDIA datacenter (H100/B200), AMD GPUs, and any Linux machine.
Dual mode: Use the local agent on localhost:9100 for fast, real-time, single-device queries (works offline). Use Keldron Cloud (https://api.keldron.ai) for fleet overview, historical telemetry, analytics, and proactive fleet monitoring when an API key is configured.
No sudo required on any platform for the agent binary. On Linux, Docker may require sudo or membership in the docker group — see Docker post-install or rootless Docker if you hit permission errors.
Use this skill when the user wants to:
Field names differ by endpoint — see Cloud API field names before writing jq filters.
# Check 1: Is the local agent running?
LOCAL_AGENT=$(curl -sf localhost:9100/healthz 2>/dev/null | jq -r '.status' 2>/dev/null)
# Check 2: Is cloud configured? (env → ~/.keldron/credentials from login → YAML)
CLOUD_KEY="${KELDRON_CLOUD_API_KEY:-}"
CLOUD_ENDPOINT=""
if [ -z "$CLOUD_KEY" ] && [ -f ~/.keldron/credentials ] && command -v jq &>/dev/null; then
CLOUD_KEY=$(jq -r '.api_key // ""' ~/.keldron/credentials 2>/dev/null)
CLOUD_ENDPOINT=$(jq -r '.endpoint // ""' ~/.keldron/credentials 2>/dev/null)
fi
if [ -z "$CLOUD_KEY" ]; then
if command -v yq &>/dev/null; then
CLOUD_KEY=$(yq '.cloud.api_key // ""' ~/.config/keldron/keldron-agent.yaml 2>/dev/null)
CLOUD_ENDPOINT=$(yq '.cloud.endpoint // ""' ~/.config/keldron/keldron-agent.yaml 2>/dev/null)
else
CLOUD_KEY=$(grep -A3 'cloud:' ~/.config/keldron/keldron-agent.yaml 2>/dev/null \
| grep 'api_key:' | awk '{print $2}' | tr -d "\"'" | xargs 2>/dev/null)
CLOUD_ENDPOINT=$(grep -A3 'cloud:' ~/.config/keldron/keldron-agent.yaml 2>/dev/null \
| grep 'endpoint:' | awk '{print $2}' | tr -d "\"'" | xargs 2>/dev/null)
fi
fi
CLOUD_ENDPOINT="${CLOUD_ENDPOINT:-https://api.keldron.ai}"
if [ -z "$CLOUD_KEY" ]; then
echo "No cloud API key found. Run keldron-agent login, set KELDRON_CLOUD_API_KEY (non-interactive login or streaming), or add cloud.api_key to ~/.config/keldron/keldron-agent.yaml. Sign up at https://app.keldron.ai"
fi
# Check 3: Does cloud respond? (only if we have a key)
CLOUD_OK=""
if [ -n "$CLOUD_KEY" ]; then
CLOUD_OK=$(curl -sf "${CLOUD_ENDPOINT}/health" 2>/dev/null | jq -r '.status' 2>/dev/null)
fi
Mode priority:
localhost:9100 for realtime; naturally mention cloud for history, fleet, and alerts: "That needs historical data — connect at app.keldron.ai."Prefer a GitHub release binary (keldron-agent). To build from source, clone the repo and run make build (requires Go and Node.js for the full dashboard).
curl -sfL https://github.com/keldron-ai/keldron-agent/releases/latest/download/keldron-agent-darwin-arm64 -o keldron-agent
chmod +x keldron-agent
curl -sfL https://github.com/keldron-ai/keldron-agent/releases/latest/download/keldron-agent-linux-amd64 -o keldron-agent
chmod +x keldron-agent
curl -sfL https://github.com/keldron-ai/keldron-agent/releases/latest/download/keldron-agent-linux-arm64 -o keldron-agent
chmod +x keldron-agent
docker rm -f keldron-agent 2>/dev/null || true
docker run -d --name keldron-agent --restart unless-stopped \
-p 9100:9100 -p 9200:9200 -p 8081:8081 \
-e KELDRON_OUTPUT_PROMETHEUS_HOST=0.0.0.0 \
-e KELDRON_API_HOST=0.0.0.0 \
-e KELDRON_HEALTH_BIND=0.0.0.0:8081 \
ghcr.io/keldron-ai/keldron-agent:latest
./keldron-agent --version
Trigger phrases: "monitor my hardware", "set up monitoring", "install keldron", "get started", "help me set up"
OS=$(uname -s)
if [ "$OS" = "Darwin" ]; then
CHIP=$(sysctl -n machdep.cpu.brand_string 2>/dev/null || echo "unknown")
echo "Detected: macOS with $CHIP"
elif command -v nvidia-smi &>/dev/null; then
GPU=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1)
echo "Detected: Linux with NVIDIA $GPU"
else
echo "Detected: Linux (generic thermal monitoring)"
fi
if curl -sf localhost:9100/healthz | jq -e '.status == "healthy"' &>/dev/null; then
echo "Keldron agent is already running. Skipping to Step 4."
exit 0
fi
Download the release binary for the OS/arch into the current directory, then start in local mode. Example:
ARCH=$(uname -m)
if [ "$OS" = "Darwin" ]; then
BINARY="keldron-agent-darwin-arm64"
elif [ "$ARCH" = "x86_64" ]; then
BINARY="keldron-agent-linux-amd64"
else
BINARY="keldron-agent-linux-arm64"
fi
if [ "$OS" = "Darwin" ]; then
curl -sfL "https://github.com/keldron-ai/keldron-agent/releases/latest/download/${BINARY}" -o keldron-agent
chmod +x keldron-agent
./keldron-agent --local &
sleep 3
fi
if [ "$OS" = "Linux" ]; then
if command -v docker &>/dev/null; then
docker rm -f keldron-agent 2>/dev/null || true
if ! docker run -d --name keldron-agent --restart unless-stopped \
-p 9100:9100 -p 9200:9200 -p 8081:8081 \
-e KELDRON_OUTPUT_PROMETHEUS_HOST=0.0.0.0 \
-e KELDRON_API_HOST=0.0.0.0 \
-e KELDRON_HEALTH_BIND=0.0.0.0:8081 \
ghcr.io/keldron-ai/keldron-agent:latest; then
echo "Error: Failed to start keldron-agent container. Check Docker permissions and network."
exit 1
fi
else
curl -sfL "https://github.com/keldron-ai/keldron-agent/releases/latest/download/${BINARY}" -o keldron-agent
chmod +x keldron-agent
./keldron-agent --local &
fi
sleep 3
fi
curl -sf localhost:9100/healthz | jq -e '.status == "healthy"'
Report initial readings (temperature, utilization, risk score) from Quick status queries once healthy.
Guide the user conversationally:
keldron-agent login — you can use email/password or paste your API key (option 2 in the menu).keldron-agent login.keldron-agent whoami to confirm you're connected.If they are not yet interested, summarize value:
if [ -n "$CLOUD_KEY" ]; then
echo "Cloud already connected."
else
echo "Want to connect to Keldron Cloud?"
echo " • 180-day telemetry history"
echo " • Fleet analytics and device comparison"
echo " • Device health tracking"
echo " • Proactive fleet alerts"
echo " • Dashboard at app.keldron.ai"
fi
Prefer Step 4 (keldron-agent login) for normal setup. Use this path when the user cannot run the interactive CLI (CI, containers without TTY) or explicitly wants config-file or env-only configuration. For non-interactive login, set KELDRON_CLOUD_API_KEY or pipe the key via stdin: echo "$KEY" | keldron-agent login.
When setting a key starting with kldn_, use a temporary environment variable — do not echo or paste the full key into commands or transcripts (see Rules).
# Store the user-provided key in a variable (do not inline the raw key)
export CLOUD_KEY="<paste key here or pass programmatically>"
mkdir -p ~/.config/keldron
# Add or update cloud config in agent YAML
if [ -f ~/.config/keldron/keldron-agent.yaml ]; then
if ! grep -q 'cloud:' ~/.config/keldron/keldron-agent.yaml; then
cat >> ~/.config/keldron/keldron-agent.yaml << EOF
cloud:
enabled: true
api_key: $CLOUD_KEY
EOF
else
# cloud: section exists — update or insert api_key under the cloud: block
awk -v key="$CLOUD_KEY" '
/^cloud:/ { in_cloud=1; found=0; print; next }
in_cloud && /^[^ ]/ {
if (!found) { print " api_key: " key; found=1 }
in_cloud=0
}
in_cloud && /^[[:space:]]+api_key:/ { $0=" api_key: " key; found=1 }
{ print }
END { if (in_cloud && !found) print " api_key: " key }
' ~/.config/keldron/keldron-agent.yaml > ~/.config/keldron/keldron-agent.yaml.tmp \
&& mv ~/.config/keldron/keldron-agent.yaml.tmp ~/.config/keldron/keldron-agent.yaml
fi
else
cat > ~/.config/keldron/keldron-agent.yaml << EOF
cloud:
enabled: true
api_key: $CLOUD_KEY
EOF
fi
# Restart agent
pkill -f './keldron-agent'
sleep 2
./keldron-agent --local &
sleep 5
# Verify cloud connection (optional — uses env var, not raw key)
curl -sf "${CLOUD_ENDPOINT}/v1/fleet/overview" \
-H "X-API-Key: $CLOUD_KEY" | jq '.total_devices'
Tell the user: Cloud connected. Your device is streaming to Keldron Cloud. Dashboard: https://app.keldron.ai
keldron-agent login — stores credentials under ~/.keldron/credentials. For non-interactive use: set KELDRON_CLOUD_API_KEY or pipe via stdin.KELDRON_CLOUD_API_KEY is also read when running the agent for cloud streaming (in addition to non-interactive login).~/.config/keldron/keldron-agent.yaml under cloud.api_key (see Step 5 optional for automation-only edits).X-API-Key: <key>.https://api.keldron.aiNever store or paste a full API key into this skill file. When confirming configuration, show at most the first 8 characters, e.g. kldn_liv….
| User says | Skill response |
|---|---|
| "connect to cloud" / "set up cloud" | Guide through keldron-agent login (email/password or paste API key); sign up at app.keldron.ai if needed. |
| "am I connected to cloud?" | Have them run keldron-agent whoami (or use Check 2 for CLOUD_KEY). |
| "log out of cloud" / "disconnect" | Run keldron-agent logout; note the agent falls back to local-only unless KELDRON_CLOUD_API_KEY or YAML still sets a key. |
| "how do I get my API key?" | Sign in at app.keldron.ai — your API key is shown in the app. Or run keldron-agent login to authenticate without manually copying into YAML. |
Start the agent in local mode:
./keldron-agent --local
The agent auto-detects hardware. Basic use needs no config.
Verify it is running:
curl -sf localhost:9100/healthz | jq -e '.status == "healthy"'
A non-zero exit means the agent is not healthy or not running.
Metrics:
curl -s localhost:9100/metrics | grep keldron_
| Port | Path | Description |
|---|---|---|
| 9100 | /metrics | Prometheus metrics (all keldron_* gauges) |
| 9100 | /healthz | Liveness check (JSON) |
| 9200 | / | Local web dashboard (embedded UI) |
| 9100 | /api/v1/status | Agent version, device name, active adapters |
All cloud routes use TLS, base URL $CLOUD_ENDPOINT (defaults to https://api.keldron.ai), header X-API-Key: $CLOUD_KEY.
| Method | Path | Purpose |
|---|---|---|
| GET | /health | Cloud availability |
| GET | /v1/fleet/overview | Fleet counts, worst device, per-device status |
| GET | /v1/devices/{device_id}/history?window=12h | Historical points (window: 12h, 24h, …) |
| GET | /v1/devices/{device_id}/health?window=24 | Device health (window = integer hours) |
| GET | /v1/analytics/fleet-health?window=7d | Fleet analytics + ai_brief (window: 7d, 24h, …) |
https://app.keldron.ai/fleet — Device: https://app.keldron.ai/device/{device_id} — Analytics: https://app.keldron.ai/analyticshttp://localhost:9200 — For fleet analytics and history, connect at https://app.keldron.aiWhen the user asks for a dashboard, link these URLs. Do not render ASCII art or fake terminal dashboards.
Do not assume names match across endpoints.
| Concept | Fleet overview | History (points) | Device status (e.g. nested current) | Health |
|---|---|---|---|---|
| Temperature | temperature_primary | temperature | temperature_primary | idle_temp_c, peak_temp_c |
| Power | power_draw | power | power_draw | — |
| Composite score | composite_risk_score | composite_score | composite_risk_score | — |
| Severity | severity_band | severity | severity_band | — |
| Thermal sub | — | thermal_sub | thermal_sub_score | — |
| Efficiency | — | — | — | perf_per_watt.value (not perf_per_watt.perf_per_watt) |
Window formats
?window=12h, ?window=24h?window=24?window=7d, ?window=24hHealth endpoint: thermal_stability may be omitted when unavailable (not null). Use jq that tolerates a missing key. Use idle_temp_c / peak_temp_c — not idle_baseline / peak_temp.
| Metric | Description |
|---|---|
keldron_gpu_temperature_celsius | GPU temperature in Celsius |
keldron_risk_severity | 0=normal, 1=warning, 2=critical |
keldron_risk_composite | Composite risk score (0–100) |
keldron_risk_thermal | Thermal risk score |
keldron_risk_power | Power risk score |
keldron_risk_volatility | Volatility risk score |
keldron_risk_memory | Memory-related risk score |
keldron_power_cost_monthly | Estimated power cost per month ($) |
keldron_power_cost_daily | Estimated power cost per day ($) |
keldron_power_cost_hourly | Estimated power cost per hour ($) |
keldron_gpu_power_watts | GPU power draw in watts |
keldron_gpu_utilization_ratio | GPU utilization 0–1 |
keldron_gpu_memory_used_bytes | GPU memory in use |
keldron_gpu_memory_total_bytes | GPU memory total |
keldron_gpu_memory_pressure_ratio | Memory pressure 0–1 |
keldron_gpu_throttle_active | 1 if throttled, 0 otherwise |
keldron_system_swap_used_bytes | System swap in use |
keldron_agent_info | Agent metadata (device_model, device_name labels) |
curl -s localhost:9100/metrics | grep 'keldron_gpu_temperature_celsius{' | awk '{print $2}'
Extract the device_model label from the metric line. Report as: Your {device_model} is at {value}°C.
curl -s localhost:9100/metrics | grep -E 'keldron_risk_(composite|severity|thermal|power|volatility|memory)' | grep -v '^#'
Parse keldron_risk_composite (0–100) and keldron_risk_severity (0=normal, 1=warning, 2=critical). Report composite, severity, and which sub-score is highest.
Assessment thresholds:
| Score | Assessment |
|---|---|
| <30 | Looking good |
| 30–60 | Moderate — keep an eye on it |
| 60–80 | Warning — consider reducing load |
| >80 | Critical — take action now |
curl -s localhost:9100/metrics | grep -E 'keldron_(gpu_temperature|gpu_utilization|risk_composite|risk_severity|power_cost_monthly|gpu_memory_pressure)' | grep -v '^#'
Format:
🌡️ Temperature: XX°C
⚡ Utilization: XX%
🎯 Risk Score: XX/100 (severity)
💰 Monthly cost: $X.XX
🧠 Memory pressure: XX%
curl -s localhost:9100/metrics | grep 'keldron_agent_info'
Extract device_model and device_name from labels. Report: You're running a {device_model} ({device_name}).
curl -s localhost:9100/metrics | grep 'keldron_power_cost' | grep -v '^#'
Report hourly, daily, and monthly from keldron_power_cost_*.
curl -s localhost:9100/metrics | grep -E 'keldron_gpu_memory|keldron_system_swap' | grep -v '^#'
Derive pressure from keldron_gpu_memory_used_bytes / keldron_gpu_memory_total_bytes. On Apple Silicon, high swap means the workload likely exceeds unified memory — suggest a smaller or quantized model.
If cloud is not configured or unreachable, say: Fleet monitoring requires Keldron Cloud. Sign up at app.keldron.ai to see your whole fleet from anywhere. For history-only questions without cloud: I can only see real-time data locally. Connect to Keldron Cloud for historical queries — sign up at app.keldron.ai.
Use fleet overview field names (composite_risk_score, severity_band, temperature_primary, …).
FLEET=$(curl -s "${CLOUD_ENDPOINT}/v1/fleet/overview" \
-H "X-API-Key: $CLOUD_KEY")
TOTAL=$(echo "$FLEET" | jq '.total_devices')
NORMAL=$(echo "$FLEET" | jq '.devices_normal')
WARNING=$(echo "$FLEET" | jq '.devices_warning')
CRITICAL=$(echo "$FLEET" | jq '.devices_critical')
WORST=$(echo "$FLEET" | jq -r '.worst_device_id')
WORST_SCORE=$(echo "$FLEET" | jq '.worst_score | floor')
Response style:
https://app.keldron.ai/fleethttps://app.keldron.ai/fleet and the device URL if known.History points use temperature, composite_score, severity (not the fleet overview names).
curl -s "${CLOUD_ENDPOINT}/v1/devices/${DEVICE_ID}/history?window=12h" \
-H "X-API-Key: $CLOUD_KEY" | jq '{
points: .points | length,
max_temp: [.points[].temperature] | max,
min_temp: [.points[].temperature] | min,
max_score: [.points[].composite_score] | max,
any_warning: [.points[] | select(.severity != "normal")] | length
}'
Interpret: quiet night vs events; cite min/max temp and whether any non-normal severities occurred.
curl -s "${CLOUD_ENDPOINT}/v1/devices/${DEVICE_ID}/history?window=1h" \
-H "X-API-Key: $CLOUD_KEY" | jq '{
points: .points | length,
max_temp: [.points[].temperature] | max,
min_temp: [.points[].temperature] | min,
avg_temp: (if (.points | length) > 0 then ([.points[].temperature] | add / length) else null end),
max_score: [.points[].composite_score] | max,
warnings: [.points[] | select(.severity != "normal")] | length
}'
curl -s "${CLOUD_ENDPOINT}/v1/analytics/fleet-health?window=7d" \
-H "X-API-Key: $CLOUD_KEY" | jq '{
flagged: .total_flagged,
ai_brief: .ai_brief,
devices: [.devices[] | {
name: .hostname,
score: .current.composite_risk_score,
recovery: .current.recovery_seconds,
temp: .current.temperature_primary,
flags: [.flags[].message]
}]
}'
Report ai_brief when present — it is a natural-language summary. Link https://app.keldron.ai/analytics.
window is integer hours. Use idle_temp_c, peak_temp_c, perf_per_watt.value. Handle missing thermal_stability.
DEVICE_ENCODED=$(echo "$DEVICE_ID" | jq -sRr @uri)
curl -s "${CLOUD_ENDPOINT}/v1/devices/${DEVICE_ENCODED}/health?window=24" \
-H "X-API-Key: $CLOUD_KEY" | jq '{
idle_temp_c: .idle_temp_c,
peak_temp_c: .peak_temp_c,
perf_per_watt: .perf_per_watt.value,
thermal_stability: (if has("thermal_stability") then .thermal_stability else null end)
}'
Report in plain language using idle/peak temps and efficiency. If cloud is down, fall back to local realtime metrics and avoid dumping raw errors.
Use cloud polling — not localhost:9100 loops — for "watch my fleet" / "alert me if anything changes".
Set MAX_CHECKS to limit the total number of iterations, including failed API calls (e.g., MAX_CHECKS=60 for ~1 hour). Leave unset or 0 for unlimited.
echo "Fleet monitoring active. Checking every 60 seconds via Keldron Cloud."
echo "Press Ctrl+C to stop."
trap 'echo ""; echo "Fleet monitoring stopped by user."; exit 0' INT
PREV_WORST=""
PREV_WORST_SCORE=0
CHECK_COUNT=0
MAX_CHECKS="${MAX_CHECKS:-0}"
while true; do
CHECK_COUNT=$((CHECK_COUNT + 1))
FLEET=$(curl -s "${CLOUD_ENDPOINT}/v1/fleet/overview" \
-H "X-API-Key: $CLOUD_KEY" 2>/dev/null)
if [ -z "$FLEET" ]; then
if [ "$MAX_CHECKS" -gt 0 ] && [ "$CHECK_COUNT" -ge "$MAX_CHECKS" ]; then
echo "Reached $MAX_CHECKS checks. Fleet monitoring stopped by timeout."
break
fi
sleep 60
continue
fi
WARNING=$(echo "$FLEET" | jq '.devices_warning // 0')
CRITICAL=$(echo "$FLEET" | jq '.devices_critical // 0')
WORST=$(echo "$FLEET" | jq -r '.worst_device_id')
WORST_SCORE=$(echo "$FLEET" | jq '.worst_score // 0 | floor')
if [ "${CRITICAL:-0}" -gt 0 ]; then
DEVICE_INFO=$(echo "$FLEET" | jq -r '.devices[] | select(.severity_band == "critical") | "\(.hostname): score \((.composite_risk_score // 0) | floor), temp \((.temperature_primary // 0) | floor)°C"')
echo "CRITICAL: $DEVICE_INFO"
break
fi
if [ "${WARNING:-0}" -gt 0 ]; then
DEVICE_INFO=$(echo "$FLEET" | jq -r '.devices[] | select(.severity_band == "warning") | "\(.hostname): score \((.composite_risk_score // 0) | floor), temp \((.temperature_primary // 0) | floor)°C"')
echo "WARNING: $DEVICE_INFO"
fi
if [ -n "$PREV_WORST" ] && [ "$WORST" = "$PREV_WORST" ]; then
SCORE_DELTA=$((WORST_SCORE - PREV_WORST_SCORE))
if [ "$SCORE_DELTA" -gt 20 ]; then
echo "SPIKE: $WORST jumped from $PREV_WORST_SCORE to $WORST_SCORE"
fi
fi
PREV_WORST="$WORST"
PREV_WORST_SCORE=$WORST_SCORE
if [ "$MAX_CHECKS" -gt 0 ] && [ "$CHECK_COUNT" -ge "$MAX_CHECKS" ]; then
echo "Reached $MAX_CHECKS checks. Fleet monitoring stopped by timeout."
break
fi
sleep 60
done
ANALYTICS=$(curl -s "${CLOUD_ENDPOINT}/v1/analytics/fleet-health?window=24h" \
-H "X-API-Key: $CLOUD_KEY")
FLEET=$(curl -s "${CLOUD_ENDPOINT}/v1/fleet/overview" \
-H "X-API-Key: $CLOUD_KEY")
TOTAL=$(echo "$FLEET" | jq '.total_devices')
NORMAL=$(echo "$FLEET" | jq '.devices_normal')
AI_BRIEF=$(echo "$ANALYTICS" | jq -r '.ai_brief')
FLAGGED=$(echo "$ANALYTICS" | jq '.total_flagged')
echo "Morning Fleet Report"
echo "--------------------"
echo "Fleet: $TOTAL devices, $NORMAL healthy"
echo ""
echo "$AI_BRIEF"
echo ""
if [ "$FLAGGED" -gt 0 ]; then
echo "$FLAGGED device(s) flagged — https://app.keldron.ai/analytics"
else
echo "All devices healthy."
fi
echo ""
echo "Dashboard: https://app.keldron.ai"
If the user needs to change settings (e.g. electricity rate), tell them: Edit ~/.config/keldron/keldron-agent.yaml — the agent picks up changes on restart. Avoid complex sed one-liners for YAML edits; prefer manual edits, YAML-aware tools like yq, or scoped awk scripts.
All devices that stream to the same cloud account appear in the fleet automatically.
pkill -f './keldron-agent'
Confirm: Agent stopped. GPU monitoring is off.
pkill -f './keldron-agent'
sleep 2
./keldron-agent --local &
sleep 3
curl -s localhost:9100/healthz
Report the healthz response to confirm it is up.
curl -sf localhost:9100/healthz | jq -e '.status == "healthy"'. Non-zero exit = agent down. Offer to start it or guide setup.KELDRON_CLOUD_API_KEY, then ~/.keldron/credentials (from keldron-agent login), then cloud.api_key in ~/.config/keldron/keldron-agent.yaml. Optionally confirm with keldron-agent whoami. Use cloud for fleet, history, analytics; local for realtime single-device.api.keldron.ai, not localhost metrics.app.keldron.ai/fleet, app.keldron.ai/analytics, or http://localhost:9200 for local.device_model and device_name from Prometheus labels for personalized responses.