Agent Cost Eval Kit

Other

Agent Cost Eval Kit — Quickly check whether an agent looks unusually expensive, then evaluate confirmed cost-control changes only when comparable evidence exists.

Install

openclaw skills install agent-cost-eval-kit

Agent Cost Eval Kit

Quickly check whether an agent looks unusually expensive, then decide whether to ignore, watch, investigate one path, run a deeper audit, or evaluate a confirmed change.

Install

openclaw skills install agent-cost-eval-kit

Fallback URL install:

openclaw skills install https://clawhub.ai/choosenobody/agent-cost-eval-kit

Install for all local agents:

openclaw skills install agent-cost-eval-kit --global

Force update:

openclaw skills install agent-cost-eval-kit --global --force

Activation

Primary:

eval agent cost change

Also triggers:

  • I suspect [agent] got more expensive
  • quick check agent cost
  • eval cost change after [action]

Quick Start

You do not need to know what changed.

Start with one sentence:

I suspect My_Agent got more expensive

or

quick check agent cost for My_Agent

The skill will first return:

  • Status
  • Likely interpretation
  • Do this now
  • Copy-paste command or request

Full before/after evaluation is optional and only used when comparable evidence exists.

What You Get

  • A quick status: No Action Needed / Watch / Investigate One Path / Run Routing Audit / Unsafe to Judge
  • A plain-English interpretation
  • One next action
  • A short copy-paste command or request

Keep / Revert / Narrow only when you provide comparable before/after evidence.

New Status Labels

StatusWhen to use
No Action NeededNo meaningful cost anomaly is visible from the provided evidence
WatchPossible cost increase, not enough evidence to act. Observe one path only.
Investigate One PathSuspicious pattern in one agent/kind/task path. Inspect that path only.
Run Routing AuditEvidence suggests possible model/routing/retry/fallback issue. Recommend audit agent routing waste.
Unsafe to JudgeHigh-risk workflow or missing quality/safety evidence prevents a safe conclusion.

Use Keep / Revert / Narrow only when you have a confirmed cost-control change with comparable before/after evidence.

Quick Check Mode

When the user says:

I suspect My_Agent got more expensive

The skill:

  1. Treats this as Quick Check mode
  2. Inspects or requests only the smallest available existing evidence
  3. Groups by same agent and same kind if data exists
  4. Avoids comparing direct / cron / subagent as the same workload
  5. Avoids treating one high-token direct chat as proof
  6. Does not ask the user to run a fresh operational task

Preferred output:

Status: Watch

Likely interpretation:
One high-token direct chat is not enough to prove My_Agent became more expensive.
It may be a long conversation or accumulated session token count, not a routing regression.

Do this now:
Do not change routing yet. Check only recent direct sessions for My_Agent.

Copy-paste:
openclaw sessions --agent My_Agent --kind direct --limit 10 --json

If --kind is not supported, use:
openclaw sessions --agent My_Agent --limit 20 --json
and group by kind manually.

Mixed Sessions (direct + cron + subagent)

Do NOT say only "Not Comparable Yet."

Use:

Status: Watch

Likely interpretation:
The data shows mixed workloads, not a clean cost regression.
The high direct-chat token count may be caused by a long active conversation.
No routing change should be made from this evidence alone.

Do this now:
Compare only direct sessions first.

Copy-paste:
openclaw sessions --agent My_Agent --limit 20 --json
Then select only rows where kind = direct.

Full Eval Mode

Use this only after you have a confirmed change and comparable evidence.

Trigger:

eval cost change after reducing retries from 4 to 2

Provide:

Before:
<paste summary>

After:
<paste summary>

Output:

Decision: Keep Change / Revert Change / Narrow Change / Watch / Unsafe to Judge

Before / After:
Cost signal:
Quality / reliability signal:
Recommendation:

Conditions to enter Full Eval Mode:

  • Same agent
  • Same kind: direct / cron / subagent (do not mix these)
  • Same task type
  • At least 3–5 runs each side
  • Cost/token data present

Safety Boundaries

This skill is read-only.

It will not:

  • edit configs
  • switch models or providers
  • disable jobs
  • run new operational tasks just to create evidence unless explicitly requested
  • treat mixed workloads as proof
  • treat one sample as proof
  • promise lower cost with equal quality
  • require secrets, API keys, private keys, credentials, or full private logs

Users should redact sensitive data before pasting.