返回顶部
f

flow-test

Designs agent-evaluated flow tests for browser tasks, LLM outputs, and tool workflows. Invoke when exact asserts are brittle and semantic success matters more than literal equality.

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 0.0.1
安全检测
已通过
91
下载量
1
收藏
概述
安装方式
版本历史

flow-test

# Flow Test Use this skill to design tests for tasks that cannot be validated reliably with traditional unit-test assertions alone. This skill is for flow testing: the agent performs a realistic task, records key evidence from the process, and then judges success with an explicit semantic rubric. Invoke this skill when: - the task depends on live or changing web content - the output can vary but still be correct - the workflow spans multiple model or tool steps - intermediate evidence matters more than one exact final string - you need to verify user intent was satisfied, not exact wording Do not use this skill when: - the result is deterministic and easy to assert directly - a schema check, exact match, snapshot, or pure function test is enough - the requirement can be covered fully by normal unit or integration tests # Objective Turn a fuzzy requirement into a test design that combines: - deterministic checks for stable invariants - evidence collection for dynamic execution - semantic evaluation for variable outcomes - a bounded verdict of `pass`, `fail`, or `needs_review` # Design Principles ## 1. Keep asserts where they still work Do not replace traditional tests blindly. Preserve exact checks for stable facts such as: - tool call success - required fields - minimum counts - status codes - domain restrictions - date or freshness constraints when machine-checkable ## 2. Judge task completion, not exact phrasing Prefer questions like: - did the agent reach the right source - did it gather relevant information - does the final answer satisfy the user request Avoid requiring one exact string unless the wording itself is the requirement. ## 3. Require inspectable evidence Ask the execution flow to print or capture concise evidence such as: - visited URL - page title - visible headings - extracted entities - timestamps or date clues - key tool outputs - final answer The evaluator should be able to inspect why a verdict was reached. ## 4. Use explicit semantic rubrics Never rely on vague instructions such as "judge whether it looks good." Always define: - what evidence is required - what counts as a pass - what clearly fails - when uncertainty should become `needs_review` ## 5. Prefer bounded confidence If evidence is incomplete, contradictory, or too weak, do not force a pass. Return `needs_review`. # Workflow When invoked, design the test in the following order. ## 1. Identify why exact assertions are brittle Classify the task: - dynamic web browsing - search or retrieval - LLM generation - multi-tool orchestration - end-to-end user flow Then explain why literal equality or fixed snapshots are not sufficient. ## 2. Split deterministic checks from semantic checks Write two groups: ### Deterministic Checks Use exact validation for stable parts, such as: - tool returned successfully - required fields are present - minimum number of results exists - source domain matches expectation - response includes a valid date range ### Semantic Checks Use agent evaluation for variable parts, such as: - relevance to the requested topic - freshness of the retrieved content - whether the answer reflects the gathered evidence - whether the workflow actually satisfies the intended task ## 3. Define the evidence schema Specify exactly what the run should log or output. Recommended evidence fields: - task - source_url - source_title - extracted_items - freshness_signals - intermediate_results - final_answer - evaluator_notes Keep evidence minimal but sufficient for review. ## 4. Define the verdict rubric Use this baseline: ### Pass - the agent reached a relevant source or completed the intended flow - collected evidence supports the conclusion - the final output is relevant and sufficiently current for the task - there is no major contradiction between evidence and answer ### Fail - the agent failed to reach a relevant source or complete the flow - the result is clearly irrelevant, stale, or fabricated - the output contradicts the evidence - the workflow misses a required user objective ### Needs Review - evidence is partial or ambiguous - freshness cannot be determined confidently - multiple interpretations remain plausible ## 5. Produce a structured test spec Return the design in this format: ```markdown ## Test Intent ## Why Exact Assert Fails ## Deterministic Checks ## Evidence To Collect ## Semantic Rubric ## Execution Notes ## Final Verdict Format ``` # Output Template ```markdown ## Test Intent - Validate that: ## Why Exact Assert Fails - Dynamic factors: - Why literal equality is brittle: ## Deterministic Checks - Check 1: - Check 2: ## Evidence To Collect - Evidence 1: - Evidence 2: ## Semantic Rubric - Pass when: - Fail when: - Needs review when: ## Execution Notes - Constraints: - Allowed variance: - Safety concerns: ## Final Verdict Format - verdict: pass | fail | needs_review - reason: - evidence: ``` # Example Task: verify that visiting a news site returns today's news rather than stale content. Good test design: - deterministic checks confirm the page loads and at least one article item is collected - evidence includes the visited site, page title, visible headlines, date clues, and final summary - semantic rubric passes when the result clearly reflects same-day or current reporting from the visited source - semantic rubric fails when headlines are outdated, unrelated, or invented - semantic rubric returns `needs_review` when freshness cannot be established from the evidence Bad test design: - `assert returned_text == "Today's news is ..."` # Guidance When using this skill: - keep traditional asserts for stable invariants - use semantic evaluation only where exact matching becomes brittle - prefer narrow rubrics over subjective judgment - require visible evidence before passing the test - state uncertainty explicitly instead of masking it # Deliverables When asked to design a flow test, provide: - a structured test spec - deterministic checks - an evidence schema - a semantic rubric - a final verdict format

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 flow-test-1776030249 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 flow-test-1776030249 技能

通过命令行安装

skillhub install flow-test-1776030249

下载 Zip 包

⬇ 下载 flow-test v0.0.1

文件大小: 3.13 KB | 发布时间: 2026-4-13 10:19

v0.0.1 最新 2026-4-13 10:19
- Initial release of the "flow-test" skill for agent-evaluated testing of browser tasks, LLM outputs, and tool workflows.
- Provides a structured approach for designing tests where exact asserts are brittle and semantic success is more important than literal equality.
- Introduces clear guidelines for when to use flow tests versus traditional assertions.
- Defines an evidence-based evaluation process, including deterministic checks, semantic rubrics, and a bounded verdict system (`pass`, `fail`, `needs_review`).
- Includes output templates and deliverable requirements for consistent test design.

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部