flow test

v0.0.1

Designs agent-evaluated flow tests for browser tasks, LLM outputs, and tool workflows. Invoke when exact asserts are brittle and semantic success matters mor...

1· 134·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for qipengguo/flow-test.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "flow test" (qipengguo/flow-test) from ClawHub.
Skill page: https://clawhub.ai/qipengguo/flow-test
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install flow-test

ClawHub CLI

Package manager switcher

npx clawhub@latest install flow-test
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The name/description match the SKILL.md: it designs semantic/flow tests and asks for evidence and rubrics. There are no unexpected env vars, binaries, or installs required that would be unrelated to test design.
Instruction Scope
The instructions are scoped to designing tests: splitting deterministic vs semantic checks, specifying evidence to collect, and defining rubrics. They do ask the executor to capture evidence (URLs, page titles, extracted items), which is appropriate for the stated purpose. Note: that evidence could include sensitive or private content if the agent runs against authenticated/private targets — the skill itself does not instruct reading local files or credentials.
Install Mechanism
No install spec and no code files — instruction-only, so nothing is written to disk or downloaded by the skill itself.
Credentials
The skill declares no required environment variables, credentials, or config paths. There is no disproportionate request for secrets or external access in the manifest or instructions.
Persistence & Privilege
always is false and the skill is user-invocable; it does not request permanent presence or modifications to other skills or global agent settings in its instructions.
Assessment
This skill appears coherent and safe as an instruction-only test designer, but keep in mind: the tests it designs will capture and log evidence (URLs, page content, extracted items), which can include sensitive or private data if run against authenticated services or user content. Only run these tests against public or authorized targets, avoid supplying unrelated credentials, and review collected evidence before storing or sharing it. If you plan to have the agent run tests autonomously against production systems, add safeguards (rate limits, access controls, and human review for 'needs_review' cases).

Like a lobster shell, security has layers — review code before you run it.

latestvk97b82phpm2hgef3vzhnc54e1s83hddz
134downloads
1stars
1versions
Updated 1mo ago
v0.0.1
MIT-0

Flow Test

Use this skill to design tests for tasks that cannot be validated reliably with traditional unit-test assertions alone.

This skill is for flow testing: the agent performs a realistic task, records key evidence from the process, and then judges success with an explicit semantic rubric.

Invoke this skill when:

  • the task depends on live or changing web content
  • the output can vary but still be correct
  • the workflow spans multiple model or tool steps
  • intermediate evidence matters more than one exact final string
  • you need to verify user intent was satisfied, not exact wording

Do not use this skill when:

  • the result is deterministic and easy to assert directly
  • a schema check, exact match, snapshot, or pure function test is enough
  • the requirement can be covered fully by normal unit or integration tests

Objective

Turn a fuzzy requirement into a test design that combines:

  • deterministic checks for stable invariants
  • evidence collection for dynamic execution
  • semantic evaluation for variable outcomes
  • a bounded verdict of pass, fail, or needs_review

Design Principles

1. Keep asserts where they still work

Do not replace traditional tests blindly. Preserve exact checks for stable facts such as:

  • tool call success
  • required fields
  • minimum counts
  • status codes
  • domain restrictions
  • date or freshness constraints when machine-checkable

2. Judge task completion, not exact phrasing

Prefer questions like:

  • did the agent reach the right source
  • did it gather relevant information
  • does the final answer satisfy the user request

Avoid requiring one exact string unless the wording itself is the requirement.

3. Require inspectable evidence

Ask the execution flow to print or capture concise evidence such as:

  • visited URL
  • page title
  • visible headings
  • extracted entities
  • timestamps or date clues
  • key tool outputs
  • final answer

The evaluator should be able to inspect why a verdict was reached.

4. Use explicit semantic rubrics

Never rely on vague instructions such as "judge whether it looks good."

Always define:

  • what evidence is required
  • what counts as a pass
  • what clearly fails
  • when uncertainty should become needs_review

5. Prefer bounded confidence

If evidence is incomplete, contradictory, or too weak, do not force a pass.

Return needs_review.

Workflow

When invoked, design the test in the following order.

1. Identify why exact assertions are brittle

Classify the task:

  • dynamic web browsing
  • search or retrieval
  • LLM generation
  • multi-tool orchestration
  • end-to-end user flow

Then explain why literal equality or fixed snapshots are not sufficient.

2. Split deterministic checks from semantic checks

Write two groups:

Deterministic Checks

Use exact validation for stable parts, such as:

  • tool returned successfully
  • required fields are present
  • minimum number of results exists
  • source domain matches expectation
  • response includes a valid date range

Semantic Checks

Use agent evaluation for variable parts, such as:

  • relevance to the requested topic
  • freshness of the retrieved content
  • whether the answer reflects the gathered evidence
  • whether the workflow actually satisfies the intended task

3. Define the evidence schema

Specify exactly what the run should log or output.

Recommended evidence fields:

  • task
  • source_url
  • source_title
  • extracted_items
  • freshness_signals
  • intermediate_results
  • final_answer
  • evaluator_notes

Keep evidence minimal but sufficient for review.

4. Define the verdict rubric

Use this baseline:

Pass

  • the agent reached a relevant source or completed the intended flow
  • collected evidence supports the conclusion
  • the final output is relevant and sufficiently current for the task
  • there is no major contradiction between evidence and answer

Fail

  • the agent failed to reach a relevant source or complete the flow
  • the result is clearly irrelevant, stale, or fabricated
  • the output contradicts the evidence
  • the workflow misses a required user objective

Needs Review

  • evidence is partial or ambiguous
  • freshness cannot be determined confidently
  • multiple interpretations remain plausible

5. Produce a structured test spec

Return the design in this format:

## Test Intent

## Why Exact Assert Fails

## Deterministic Checks

## Evidence To Collect

## Semantic Rubric

## Execution Notes

## Final Verdict Format

Output Template

## Test Intent
- Validate that:

## Why Exact Assert Fails
- Dynamic factors:
- Why literal equality is brittle:

## Deterministic Checks
- Check 1:
- Check 2:

## Evidence To Collect
- Evidence 1:
- Evidence 2:

## Semantic Rubric
- Pass when:
- Fail when:
- Needs review when:

## Execution Notes
- Constraints:
- Allowed variance:
- Safety concerns:

## Final Verdict Format
- verdict: pass | fail | needs_review
- reason:
- evidence:

Example

Task: verify that visiting a news site returns today's news rather than stale content.

Good test design:

  • deterministic checks confirm the page loads and at least one article item is collected
  • evidence includes the visited site, page title, visible headlines, date clues, and final summary
  • semantic rubric passes when the result clearly reflects same-day or current reporting from the visited source
  • semantic rubric fails when headlines are outdated, unrelated, or invented
  • semantic rubric returns needs_review when freshness cannot be established from the evidence

Bad test design:

  • assert returned_text == "Today's news is ..."

Guidance

When using this skill:

  • keep traditional asserts for stable invariants
  • use semantic evaluation only where exact matching becomes brittle
  • prefer narrow rubrics over subjective judgment
  • require visible evidence before passing the test
  • state uncertainty explicitly instead of masking it

Deliverables

When asked to design a flow test, provide:

  • a structured test spec
  • deterministic checks
  • an evidence schema
  • a semantic rubric
  • a final verdict format

Comments

Loading comments...