Install
openclaw skills install flow-testDesigns agent-evaluated flow tests for browser tasks, LLM outputs, and tool workflows. Invoke when exact asserts are brittle and semantic success matters more than literal equality.
openclaw skills install flow-testUse this skill to design tests for tasks that cannot be validated reliably with traditional unit-test assertions alone.
This skill is for flow testing: the agent performs a realistic task, records key evidence from the process, and then judges success with an explicit semantic rubric.
Invoke this skill when:
Do not use this skill when:
Turn a fuzzy requirement into a test design that combines:
pass, fail, or needs_reviewDo not replace traditional tests blindly. Preserve exact checks for stable facts such as:
Prefer questions like:
Avoid requiring one exact string unless the wording itself is the requirement.
Ask the execution flow to print or capture concise evidence such as:
The evaluator should be able to inspect why a verdict was reached.
Never rely on vague instructions such as "judge whether it looks good."
Always define:
needs_reviewIf evidence is incomplete, contradictory, or too weak, do not force a pass.
Return needs_review.
When invoked, design the test in the following order.
Classify the task:
Then explain why literal equality or fixed snapshots are not sufficient.
Write two groups:
Use exact validation for stable parts, such as:
Use agent evaluation for variable parts, such as:
Specify exactly what the run should log or output.
Recommended evidence fields:
Keep evidence minimal but sufficient for review.
Use this baseline:
Return the design in this format:
## Test Intent
## Why Exact Assert Fails
## Deterministic Checks
## Evidence To Collect
## Semantic Rubric
## Execution Notes
## Final Verdict Format
## Test Intent
- Validate that:
## Why Exact Assert Fails
- Dynamic factors:
- Why literal equality is brittle:
## Deterministic Checks
- Check 1:
- Check 2:
## Evidence To Collect
- Evidence 1:
- Evidence 2:
## Semantic Rubric
- Pass when:
- Fail when:
- Needs review when:
## Execution Notes
- Constraints:
- Allowed variance:
- Safety concerns:
## Final Verdict Format
- verdict: pass | fail | needs_review
- reason:
- evidence:
Task: verify that visiting a news site returns today's news rather than stale content.
Good test design:
needs_review when freshness cannot be established from the evidenceBad test design:
assert returned_text == "Today's news is ..."When using this skill:
When asked to design a flow test, provide: