OpenClaw Reliability

Read-only reliability smoke checks and health summaries for OpenClaw agents

Audits

Pass

Install

openclaw plugins install clawhub:openclaw-reliability

openclaw-reliability

Read-only reliability smoke checks and health summaries for OpenClaw agents.

Goals

  • Detect reliability risks before they become agent failures.
  • Classify tool/provider/runtime/plugin problems.
  • Provide local, redacted, reusable health summaries.
  • Stay safe by default: no restarts, no config changes, no plugin disabling.

Non-goals

  • This is not a security boundary. Use openclaw-language-boundary for action policy.
  • This does not automatically fix host/firewall/SSH/Gateway config.
  • This does not upload logs.

Current usage

npm run typecheck
npm run smoke
npm run report
npm run repair-plan
npm run release:check
npm run smoke -- --cpu-sample-seconds=8
npm run smoke -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
npm run smoke -- --probe-tools=web_fetch,web_search --probe-timeout-ms=10000
npm run smoke:json
npm run repair-plan:json

npm run release:check runs the read-only internal gate: typecheck, tests, build, required docs, example JSON fixture validation, and live passive smoke:json schema validation. Use -- --skip-smoke for offline fixture-only release checks.

npm run report wraps passive smoke JSON into a shorter human-readable report with:

  • overall health result
  • blockers vs acceptable warnings
  • recommended next action
  • whether external probes were run
  • whether any state-changing remediation should be reviewed first

Example outputs and interpretation guides are available in examples/, docs/reliability-examples.md, and docs/state-changing-repair-design.md:

  • passive-smoke-output.txt
  • smoke-json-sample.json
  • report-output.txt
  • clean / OK interpretation
  • runtime sampling noise
  • provider timeout
  • provider HTTP / redirect errors
  • caller input / path-policy errors
  • operator approval outcomes
  • source-only plugin / runtime output warnings
  • future state-changing repair safety contract

Current checks include:

  • Gateway /health plus sustained process CPU sampling.
  • Diagnostic log signals since the latest Gateway start window.
  • Config warnings from openclaw config validate.
  • Web provider credential mismatch, including tools.web.search / legacy web.search, Tavily, and MiniMax model-vs-search credential paths.
  • Source-only local extension shadows that cause compiled-runtime warning noise.
  • language-boundary runtime state and audit summary, with stale historical tool failures ignored after a freshness window.
  • Skills root pressure.
  • Channel session-expired log noise.
  • Optional active probes for selected tools/providers. Disabled by default; pass --probe-tools=web_fetch,web_search to run lightweight external checks.

Result categories

The smoke output separates true runtime degradation from short-window sampling noise:

  • runtime_degraded: sustained or unclear Gateway pressure. Treat this as actionable investigation material: active sessions, memory indexing, plugin hooks, cron jobs, channel loops, and file-descriptor growth.
  • runtime_health_sampling_noise: /health reports event_loop_utilization / cpu, but sustained CPU sampling is low or the latest diagnostic logs are clearly model_call / session-queue related. Do not restart or disable components for this alone; re-test when no model call is active.
  • provider_missing_config: a provider is selected/enabled but its tool-specific credential path is missing.
  • plugin_runtime_output_missing: a local source-only extension shadows a compiled installed package and may create loader/config warning noise.
  • tool_degraded: current non-healthy runtime-state record for a tool/provider within the freshness window.
  • tool_failure_stale: historical non-healthy runtime-state record older than the freshness window or followed by a newer success. Stale records should not drive operational decisions.
  • provider_timeout: provider/tool timeout. Prefer opt-in active probes and provider reachability checks before changing runtime state.
  • provider_http_error: HTTP/redirect endpoint behavior such as 404 or too many redirects. Usually not Gateway degradation.
  • tool_input_error: caller/input/path-policy failure such as invalid parameters, edit range overlap, or disallowed local media path. Fix caller input before changing runtime state.
  • operator_approval: approval timeout/denial. Treat as governance outcome, not tool/runtime degradation.

Repair plan (advisory only)

npm run repair-plan converts smoke findings into a ranked advisory plan. It does not execute commands, edit config, restart Gateway, disable plugins, or change cron jobs.

npm run repair-plan
npm run repair-plan:json

The JSON schema is openclaw.reliability.repair-plan.v1. Each action includes severity, confidence, risk, confirmation requirements, evidence, manual steps, rollback notes when a future system-changing remediation might exist, and a ticket object for operator handoff:

  • trigger — why this action appeared
  • currentImpact — what is affected now, and what is not implied
  • recommendedOwner — operator, developer, provider, or security
  • escalation — when to move beyond the default recommendation
  • doNotDo — explicit anti-actions, such as not restarting Gateway for provider HTTP errors
  • postCheck — read-only checks to run after investigation or manual repair

Use this as the v0.2 bridge between diagnosis and future confirmed repair flows.

Confirmed repair scaffold

npm run repair is the v0.3 safety scaffold. It can preview a selected repair-plan action and can confirm only explicitly implemented read-only diagnostic actions. It does not restart Gateway, edit config, write credentials, disable plugins, delete files, change cron jobs, or mutate channel state.

npm run repair-plan
npm run repair -- --action <actionId> --dry-run
npm run repair -- --action <actionId> --confirm

Current behavior:

  • missing --action is blocked
  • unknown action ids are blocked
  • --dry-run prints the selected action, suggested commands, manual steps, and rollback notes
  • --confirm is implemented only for safe diagnostic actions:
    • retest-runtime-sampling-noise: reruns smoke
    • investigate-provider-timeout: runs an explicit web_fetch active probe
    • investigate-provider-http-error: runs an explicit web_fetch active probe for HTTP/redirect behavior
    • inspect-current-tool-degraded: reruns smoke to inspect current tool/provider state
    • review-tool-input-error: explains caller/input/path-policy issues without executing commands
    • fix-provider-missing-config: reruns openclaw config validate without writing credentials
  • state-changing actions remain blocked even with --confirm, including skills cleanup, channel re-auth/disablement, plugin cleanup/rebuild, config cleanup, and runtime restart investigations

State-changing repair is design-only today. See docs/state-changing-repair-design.md for the required dry-run, confirm, backup, rollback, audit, redaction, and post-check contract before any mutating action is implemented.

Active probes

Smoke stays passive by default. Active probes are opt-in because they can create outbound network traffic and consume provider quota.

npm run smoke -- --probe-tools=web_fetch
npm run smoke -- --probe-tools=web_fetch,web_search --probe-timeout-ms=10000

Current probes:

  • web_fetch: HTTP GET https://example.com using the local runtime network path.
  • web_search: Tavily search probe when tools.web.search.provider=tavily and plugins.entries.tavily.config.webSearch.apiKey or TAVILY_API_KEY is configured.

Unsupported probe names produce a warning rather than failing the whole smoke run.

Freshness windows

Historical provider/tool failures can otherwise pollute smoke results long after recovery. Runtime-state checks therefore treat a non-healthy tool record as current only when:

  • it has a failure timestamp inside --failure-freshness-minutes (default 60), and
  • there is no newer lastSuccessAt timestamp for that tool.

If all non-healthy records are stale, smoke reports the runtime-state checks as OK and mentions the stale records in detail.

Current status on this host

Last verified 2026-05-13 16:22 Asia/Shanghai:

npm run typecheck
npm run smoke -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
Summary: ok=11 warn=2 fail=0

The remaining warnings were both runtime_health_sampling_noise: 15s Gateway CPU average was low, and latest liveness/long-running diagnostics were model-call/session-queue related. Treat the P0 runtime stability incident as recovered unless future smoke runs show sustained CPU, repeated non-model-call liveness warnings, plugin errors, provider failures, or channel loops.

Current known remediation candidates

These are intentionally not automatic:

  1. If runtime_degraded returns outside active model calls, investigate active sessions, memory indexing, plugin hooks, cron jobs, channel loops, and file descriptors before restarting Gateway.
  2. Add freshness/active-probe semantics for historical runtime-state tool failures.
  3. Keep web search provider checks aligned with current config paths (tools.web.search first, legacy web.search fallback).
  4. Track security-hardening warnings separately from runtime reliability: exec security, Control UI auth posture, trusted proxies, and plugin pinning are not P0 stability failures.

Mainline closeout

openclaw-reliability is now in a stable usable state rather than an active incident response state.

Latest validation should be generated locally with:

npm run typecheck
npm run smoke -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
npm run report -- --cpu-sample-seconds=15 --failure-freshness-minutes=60

Interpretation:

  • fail=0 means the system is usable for diagnostics.
  • runtime_health_sampling_noise is usually acceptable when sustained CPU is low and diagnostics are model-call/session-queue related.
  • Fresh runtime_degraded, plugin errors, provider failures, or channel loops should be investigated before changing runtime behavior.

Resolved during this mainline:

  • failing remote embedding/provider paths are classified as provider/tool issues rather than global runtime failure
  • session-expired channel noise is detected separately
  • noisy cron delivery failures are detected as operational noise rather than plugin packaging failure
  • stale source-only extension shadows are detected
  • skills pressure is summarized
  • language-boundary runtime state and audit summary are included when available

Reusability requirements

  • No hard-coded user paths such as /Users/<name> or machine-specific workspace paths.
  • No dependency on a specific agent/session/machine name.
  • No dependency on a specific channel, provider account, local credential, cron job, or local extension layout.
  • Host-specific paths or providers may appear only as examples or local validation notes, never runtime defaults.
  • All future state paths must be configurable or derived from OpenClaw/home directory.
  • External probes must stay opt-in and must not assume any specific provider, channel, or local credential exists.
  • Default behavior must remain read-only.