Install
openclaw skills install agent-cost-eval-kitAgent Cost Eval Kit — Quickly check whether an agent looks unusually expensive, then evaluate confirmed cost-control changes only when comparable evidence exists.
openclaw skills install agent-cost-eval-kitQuickly check whether an agent looks unusually expensive, then decide whether to ignore, watch, investigate one path, run a deeper audit, or evaluate a confirmed change.
openclaw skills install agent-cost-eval-kit
Fallback URL install:
openclaw skills install https://clawhub.ai/choosenobody/agent-cost-eval-kit
Install for all local agents:
openclaw skills install agent-cost-eval-kit --global
Force update:
openclaw skills install agent-cost-eval-kit --global --force
Primary:
eval agent cost change
Also triggers:
You do not need to know what changed.
Start with one sentence:
I suspect My_Agent got more expensive
or
quick check agent cost for My_Agent
The skill will first return:
Full before/after evaluation is optional and only used when comparable evidence exists.
Keep / Revert / Narrow only when you provide comparable before/after evidence.
| Status | When to use |
|---|---|
| No Action Needed | No meaningful cost anomaly is visible from the provided evidence |
| Watch | Possible cost increase, not enough evidence to act. Observe one path only. |
| Investigate One Path | Suspicious pattern in one agent/kind/task path. Inspect that path only. |
| Run Routing Audit | Evidence suggests possible model/routing/retry/fallback issue. Recommend audit agent routing waste. |
| Unsafe to Judge | High-risk workflow or missing quality/safety evidence prevents a safe conclusion. |
Quick Check mode must always produce one of the five statuses above. "Not Comparable Yet" is not a valid Quick Check status.
Use Keep / Revert / Narrow only when you have a confirmed cost-control change with comparable before/after evidence.
When the user says:
I suspect My_Agent got more expensive
or
quick check agent cost for My_Agent
The skill:
Required output format for Quick Check mode:
Status: [one of the five statuses above]
Plain-English conclusion:
[2-4 sentences max]
Do now:
[one concrete action]
Don't do now:
[one thing to skip]
Quick Check mode must always end with a concrete action recommendation. It must not end with a generic request for more samples, more data, or more runs.
Do NOT use "Not Comparable Yet" as the final status.
For mixed sessions, default to Watch unless there is clear evidence of a cost anomaly.
Example expected output for mixed sessions:
Status: Watch
Plain-English conclusion:
There is no clear proof that My_Agent became more expensive. The data shows mixed
workloads (direct, cron, subagent), which are not directly comparable. One high-token
direct session is not enough to prove a routing regression.
Do now:
Watch only direct sessions for the next 24 hours.
Don't do now:
Do not change routing, switch models, or run new operational tasks just to create a baseline.
When the evidence shows:
Do this:
Status: Watch
Plain-English conclusion:
High totalTokens may not mean high incremental cost. When cacheRead is very high,
most tokens are from cached context, not new generation. Do not treat this as a
cost regression without more evidence.
Do now:
Watch only fresh-generation tokens (totalTokens minus cacheRead) over the next 24 hours.
Don't do now:
Do not run new tasks just to create baseline data. Prefer observing naturally occurring runs.
Always explain: high totalTokens with high cacheRead is not the same as high incremental cost.
For this specific pattern:
Expected output:
Status: Watch
Plain-English conclusion:
There is no clear proof that ops_cat became more expensive. The gpt-5.4 direct session
has high totalTokens, but most appears to be cacheRead, so it should not be treated as
a direct cost regression. Mixed direct / cron / subagent sessions are not comparable.
Do now:
Watch only ops_cat direct sessions for the next 24 hours.
Don't do now:
Do not change routing, switch models, or run new operational tasks just to create a baseline.
Copy-paste command (only if the user asks):
openclaw sessions --agent ops_cat --limit 20 --json
Use this only after you have a confirmed change and comparable evidence.
Trigger:
eval cost change after reducing retries from 4 to 2
Provide:
Before:
<paste summary>
After:
<paste summary>
Output:
Decision: Keep Change / Revert Change / Narrow Change / Watch / Unsafe to Judge
Before / After:
Cost signal:
Quality / reliability signal:
Recommendation:
Conditions to enter Full Eval Mode:
Quick Check mode must always end with a concrete action recommendation. It must not end with a generic request for 3–5 more samples or more runs.
Do not recommend running new operational work just to manufacture eval data by default. Prefer observation of naturally occurring future runs. Only recommend creating new evidence if the user explicitly asks for baseline collection.
"Not Comparable Yet" is not a valid Quick Check status. Use one of the five defined statuses.
When cacheRead dominates totalTokens, explain clearly that high totalTokens may not mean high incremental cost. Do not treat high-cache situations as cost regressions without more evidence.
Keep output short in Quick Check mode. No long tables unless they are already available and genuinely helpful.
This skill is read-only.
It will not:
Users should redact sensitive data before pasting.