flow test

v0.0.1

Designs agent-evaluated flow tests for browser tasks, LLM outputs, and tool workflows. Invoke when exact asserts are brittle and semantic success matters mor...

⭐ 1· 134·0 current·0 all-time

by@qipengguo

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for qipengguo/flow-test.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "flow test" (qipengguo/flow-test) from ClawHub.
Skill page: https://clawhub.ai/qipengguo/flow-test
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install flow-test

ClawHub CLI

Package manager switcher

npx clawhub@latest install flow-test

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

The name/description match the SKILL.md: it designs semantic/flow tests and asks for evidence and rubrics. There are no unexpected env vars, binaries, or installs required that would be unrelated to test design.

ℹ

Instruction Scope

The instructions are scoped to designing tests: splitting deterministic vs semantic checks, specifying evidence to collect, and defining rubrics. They do ask the executor to capture evidence (URLs, page titles, extracted items), which is appropriate for the stated purpose. Note: that evidence could include sensitive or private content if the agent runs against authenticated/private targets — the skill itself does not instruct reading local files or credentials.

✓

Install Mechanism

No install spec and no code files — instruction-only, so nothing is written to disk or downloaded by the skill itself.

✓

Credentials

The skill declares no required environment variables, credentials, or config paths. There is no disproportionate request for secrets or external access in the manifest or instructions.

✓

Persistence & Privilege

always is false and the skill is user-invocable; it does not request permanent presence or modifications to other skills or global agent settings in its instructions.

Assessment

This skill appears coherent and safe as an instruction-only test designer, but keep in mind: the tests it designs will capture and log evidence (URLs, page content, extracted items), which can include sensitive or private data if run against authenticated services or user content. Only run these tests against public or authorized targets, avoid supplying unrelated credentials, and review collected evidence before storing or sharing it. If you plan to have the agent run tests autonomously against production systems, add safeguards (rate limits, access controls, and human review for 'needs_review' cases).

Like a lobster shell, security has layers — review code before you run it.

latestvk97b82phpm2hgef3vzhnc54e1s83hddz

134downloads

1stars

1versions

Updated 1mo ago

v0.0.1

MIT-0

Flow Test

Use this skill to design tests for tasks that cannot be validated reliably with traditional unit-test assertions alone.

This skill is for flow testing: the agent performs a realistic task, records key evidence from the process, and then judges success with an explicit semantic rubric.

Invoke this skill when:

the task depends on live or changing web content
the output can vary but still be correct
the workflow spans multiple model or tool steps
intermediate evidence matters more than one exact final string
you need to verify user intent was satisfied, not exact wording

Do not use this skill when:

the result is deterministic and easy to assert directly
a schema check, exact match, snapshot, or pure function test is enough
the requirement can be covered fully by normal unit or integration tests

Objective

Turn a fuzzy requirement into a test design that combines:

deterministic checks for stable invariants
evidence collection for dynamic execution
semantic evaluation for variable outcomes
a bounded verdict of pass, fail, or needs_review

Design Principles

1. Keep asserts where they still work

Do not replace traditional tests blindly. Preserve exact checks for stable facts such as:

tool call success
required fields
minimum counts
status codes
domain restrictions
date or freshness constraints when machine-checkable

2. Judge task completion, not exact phrasing

Prefer questions like:

did the agent reach the right source
did it gather relevant information
does the final answer satisfy the user request

Avoid requiring one exact string unless the wording itself is the requirement.

3. Require inspectable evidence

Ask the execution flow to print or capture concise evidence such as:

visited URL
page title
visible headings
extracted entities
timestamps or date clues
key tool outputs
final answer

The evaluator should be able to inspect why a verdict was reached.

4. Use explicit semantic rubrics

Never rely on vague instructions such as "judge whether it looks good."

Always define:

what evidence is required
what counts as a pass
what clearly fails
when uncertainty should become needs_review

5. Prefer bounded confidence

If evidence is incomplete, contradictory, or too weak, do not force a pass.

Return needs_review.

Workflow

When invoked, design the test in the following order.

1. Identify why exact assertions are brittle

Classify the task:

dynamic web browsing
search or retrieval
LLM generation
multi-tool orchestration
end-to-end user flow

Then explain why literal equality or fixed snapshots are not sufficient.

2. Split deterministic checks from semantic checks

Write two groups:

Deterministic Checks

Use exact validation for stable parts, such as:

tool returned successfully
required fields are present
minimum number of results exists
source domain matches expectation
response includes a valid date range

Semantic Checks

Use agent evaluation for variable parts, such as:

relevance to the requested topic
freshness of the retrieved content
whether the answer reflects the gathered evidence
whether the workflow actually satisfies the intended task

3. Define the evidence schema

Specify exactly what the run should log or output.

Recommended evidence fields:

task
source_url
source_title
extracted_items
freshness_signals
intermediate_results
final_answer
evaluator_notes

Keep evidence minimal but sufficient for review.

4. Define the verdict rubric

Use this baseline:

Pass

the agent reached a relevant source or completed the intended flow
collected evidence supports the conclusion
the final output is relevant and sufficiently current for the task
there is no major contradiction between evidence and answer

Fail

the agent failed to reach a relevant source or complete the flow
the result is clearly irrelevant, stale, or fabricated
the output contradicts the evidence
the workflow misses a required user objective

Needs Review

evidence is partial or ambiguous
freshness cannot be determined confidently
multiple interpretations remain plausible

5. Produce a structured test spec

Return the design in this format:

## Test Intent

## Why Exact Assert Fails

## Deterministic Checks

## Evidence To Collect

## Semantic Rubric

## Execution Notes

## Final Verdict Format

Output Template

## Test Intent
- Validate that:

## Why Exact Assert Fails
- Dynamic factors:
- Why literal equality is brittle:

## Deterministic Checks
- Check 1:
- Check 2:

## Evidence To Collect
- Evidence 1:
- Evidence 2:

## Semantic Rubric
- Pass when:
- Fail when:
- Needs review when:

## Execution Notes
- Constraints:
- Allowed variance:
- Safety concerns:

## Final Verdict Format
- verdict: pass | fail | needs_review
- reason:
- evidence:

Example

Task: verify that visiting a news site returns today's news rather than stale content.

Good test design:

deterministic checks confirm the page loads and at least one article item is collected
evidence includes the visited site, page title, visible headlines, date clues, and final summary
semantic rubric passes when the result clearly reflects same-day or current reporting from the visited source
semantic rubric fails when headlines are outdated, unrelated, or invented
semantic rubric returns needs_review when freshness cannot be established from the evidence

Bad test design:

assert returned_text == "Today's news is ..."

Guidance

When using this skill:

keep traditional asserts for stable invariants
use semantic evaluation only where exact matching becomes brittle
prefer narrow rubrics over subjective judgment
require visible evidence before passing the test
state uncertainty explicitly instead of masking it

Deliverables

When asked to design a flow test, provide:

a structured test spec
deterministic checks
an evidence schema
a semantic rubric
a final verdict format

Comments

Loading comments...