Agent Cost Eval Kit

Other

Agent Cost Eval Kit — Quickly check whether an agent looks unusually expensive, then evaluate confirmed cost-control changes only when comparable evidence exists.

Install

openclaw skills install agent-cost-eval-kit

Agent Cost Eval Kit

Quickly check whether an agent looks unusually expensive, then decide whether to ignore, watch, investigate one path, run a deeper audit, or evaluate a confirmed change.

Install

openclaw skills install agent-cost-eval-kit

Fallback URL install:

openclaw skills install https://clawhub.ai/choosenobody/agent-cost-eval-kit

Install for all local agents:

openclaw skills install agent-cost-eval-kit --global

Force update:

openclaw skills install agent-cost-eval-kit --global --force

Activation

Primary:

eval agent cost change

Also triggers:

  • I suspect [agent] got more expensive
  • quick check agent cost for [agent]
  • eval cost change after [action]
  • quick check agent cost

Quick Start

You do not need to know what changed.

Start with one sentence:

I suspect My_Agent got more expensive

or

quick check agent cost for My_Agent

The skill will first return:

  • Status
  • Plain-English conclusion
  • Do now
  • Don't do now

Full before/after evaluation is optional and only used when comparable evidence exists.

What You Get

  • A quick status: No Action Needed / Watch / Investigate One Path / Run Routing Audit / Unsafe to Judge
  • A plain-English conclusion
  • One action to take now
  • One action to skip for now

Keep / Revert / Narrow only when you provide comparable before/after evidence.

Status Labels

StatusWhen to use
No Action NeededNo meaningful cost anomaly is visible from the provided evidence
WatchPossible cost increase, not enough evidence to act. Observe one path only.
Investigate One PathSuspicious pattern in one agent/kind/task path. Inspect that path only.
Run Routing AuditEvidence suggests possible model/routing/retry/fallback issue. Recommend audit agent routing waste.
Unsafe to JudgeHigh-risk workflow or missing quality/safety evidence prevents a safe conclusion.

Quick Check mode must always produce one of the five statuses above. "Not Comparable Yet" is not a valid Quick Check status.

Use Keep / Revert / Narrow only when you have a confirmed cost-control change with comparable before/after evidence.

Quick Check Mode

When the user says:

I suspect My_Agent got more expensive

or

quick check agent cost for My_Agent

The skill:

  1. Treats this as Quick Check mode
  2. Inspects or requests only the smallest available existing evidence
  3. Groups by same agent and same kind if data exists
  4. Avoids comparing direct / cron / subagent as the same workload
  5. Avoids treating one high-token direct chat as proof
  6. Does not ask the user to run a fresh operational task

Required output format for Quick Check mode:

Status: [one of the five statuses above]

Plain-English conclusion:
[2-4 sentences max]

Do now:
[one concrete action]

Don't do now:
[one thing to skip]

Quick Check mode must always end with a concrete action recommendation. It must not end with a generic request for more samples, more data, or more runs.

Mixed Sessions (direct + cron + subagent)

Do NOT use "Not Comparable Yet" as the final status.

For mixed sessions, default to Watch unless there is clear evidence of a cost anomaly.

Example expected output for mixed sessions:

Status: Watch

Plain-English conclusion:
There is no clear proof that My_Agent became more expensive. The data shows mixed
workloads (direct, cron, subagent), which are not directly comparable. One high-token
direct session is not enough to prove a routing regression.

Do now:
Watch only direct sessions for the next 24 hours.

Don't do now:
Do not change routing, switch models, or run new operational tasks just to create a baseline.

When cacheRead Dominates totalTokens

When the evidence shows:

  • Very high cacheRead relative to totalTokens
  • One or a few direct sessions with high totalTokens
  • No clean before/after baseline

Do this:

Status: Watch

Plain-English conclusion:
High totalTokens may not mean high incremental cost. When cacheRead is very high,
most tokens are from cached context, not new generation. Do not treat this as a
cost regression without more evidence.

Do now:
Watch only fresh-generation tokens (totalTokens minus cacheRead) over the next 24 hours.

Don't do now:
Do not run new tasks just to create baseline data. Prefer observing naturally occurring runs.

Always explain: high totalTokens with high cacheRead is not the same as high incremental cost.

ops_cat-Style Case (Mixed direct/cron/subagent, high cacheRead)

For this specific pattern:

  • Mixed direct / cron / subagent sessions
  • One direct session has high totalTokens
  • cacheRead is very high
  • No clean before/after baseline

Expected output:

Status: Watch

Plain-English conclusion:
There is no clear proof that ops_cat became more expensive. The gpt-5.4 direct session
has high totalTokens, but most appears to be cacheRead, so it should not be treated as
a direct cost regression. Mixed direct / cron / subagent sessions are not comparable.

Do now:
Watch only ops_cat direct sessions for the next 24 hours.

Don't do now:
Do not change routing, switch models, or run new operational tasks just to create a baseline.

Copy-paste command (only if the user asks):

openclaw sessions --agent ops_cat --limit 20 --json

Full Eval Mode

Use this only after you have a confirmed change and comparable evidence.

Trigger:

eval cost change after reducing retries from 4 to 2

Provide:

Before:
<paste summary>

After:
<paste summary>

Output:

Decision: Keep Change / Revert Change / Narrow Change / Watch / Unsafe to Judge

Before / After:
Cost signal:
Quality / reliability signal:
Recommendation:

Conditions to enter Full Eval Mode:

  • Same agent
  • Same kind: direct / cron / subagent (do not mix these)
  • Same task type
  • At least 3–5 runs each side
  • Cost/token data present

Hard Rules

  1. Quick Check mode must always end with a concrete action recommendation. It must not end with a generic request for 3–5 more samples or more runs.

  2. Do not recommend running new operational work just to manufacture eval data by default. Prefer observation of naturally occurring future runs. Only recommend creating new evidence if the user explicitly asks for baseline collection.

  3. "Not Comparable Yet" is not a valid Quick Check status. Use one of the five defined statuses.

  4. When cacheRead dominates totalTokens, explain clearly that high totalTokens may not mean high incremental cost. Do not treat high-cache situations as cost regressions without more evidence.

  5. Keep output short in Quick Check mode. No long tables unless they are already available and genuinely helpful.

Safety Boundaries

This skill is read-only.

It will not:

  • edit configs
  • switch models or providers
  • disable jobs
  • run new operational tasks just to create evidence unless explicitly requested
  • treat mixed workloads as proof
  • treat one sample as proof
  • promise lower cost with equal quality
  • require secrets, API keys, private keys, credentials, or full private logs

Users should redact sensitive data before pasting.