flow-test

# Flow Test Use this skill to design tests for tasks that cannot be validated reliably with traditional unit-test assertions alone. This skill is for flow testing: the agent performs a realistic task, records key evidence from the process, and then judges success with an explicit semantic rubric. Invoke this skill when: - the task depends on live or changing web content - the output can vary but still be correct - the workflow spans multiple model or tool steps - intermediate evidence matters more than one exact final string - you need to verify user intent was satisfied, not exact wording Do not use this skill when: - the result is deterministic and easy to assert directly - a schema check, exact match, snapshot, or pure function test is enough - the requirement can be covered fully by normal unit or integration tests # Objective Turn a fuzzy requirement into a test design that combines: - deterministic checks for stable invariants - evidence collection for dynamic execution - semantic evaluation for variable outcomes - a bounded verdict of `pass`, `fail`, or `needs_review` # Design Principles ## 1. Keep asserts where they still work Do not replace traditional tests blindly. Preserve exact checks for stable facts such as: - tool call success - required fields - minimum counts - status codes - domain restrictions - date or freshness constraints when machine-checkable ## 2. Judge task completion, not exact phrasing Prefer questions like: - did the agent reach the right source - did it gather relevant information - does the final answer satisfy the user request Avoid requiring one exact string unless the wording itself is the requirement. ## 3. Require inspectable evidence Ask the execution flow to print or capture concise evidence such as: - visited URL - page title - visible headings - extracted entities - timestamps or date clues - key tool outputs - final answer The evaluator should be able to inspect why a verdict was reached. ## 4. Use explicit semantic rubrics Never rely on vague instructions such as "judge whether it looks good." Always define: - what evidence is required - what counts as a pass - what clearly fails - when uncertainty should become `needs_review` ## 5. Prefer bounded confidence If evidence is incomplete, contradictory, or too weak, do not force a pass. Return `needs_review`. # Workflow When invoked, design the test in the following order. ## 1. Identify why exact assertions are brittle Classify the task: - dynamic web browsing - search or retrieval - LLM generation - multi-tool orchestration - end-to-end user flow Then explain why literal equality or fixed snapshots are not sufficient. ## 2. Split deterministic checks from semantic checks Write two groups: ### Deterministic Checks Use exact validation for stable parts, such as: - tool returned successfully - required fields are present - minimum number of results exists - source domain matches expectation - response includes a valid date range ### Semantic Checks Use agent evaluation for variable parts, such as: - relevance to the requested topic - freshness of the retrieved content - whether the answer reflects the gathered evidence - whether the workflow actually satisfies the intended task ## 3. Define the evidence schema Specify exactly what the run should log or output. Recommended evidence fields: - task - source_url - source_title - extracted_items - freshness_signals - intermediate_results - final_answer - evaluator_notes Keep evidence minimal but sufficient for review. ## 4. Define the verdict rubric Use this baseline: ### Pass - the agent reached a relevant source or completed the intended flow - collected evidence supports the conclusion - the final output is relevant and sufficiently current for the task - there is no major contradiction between evidence and answer ### Fail - the agent failed to reach a relevant source or complete the flow - the result is clearly irrelevant, stale, or fabricated - the output contradicts the evidence - the workflow misses a required user objective ### Needs Review - evidence is partial or ambiguous - freshness cannot be determined confidently - multiple interpretations remain plausible ## 5. Produce a structured test spec Return the design in this format: ```markdown ## Test Intent ## Why Exact Assert Fails ## Deterministic Checks ## Evidence To Collect ## Semantic Rubric ## Execution Notes ## Final Verdict Format ``` # Output Template ```markdown ## Test Intent - Validate that: ## Why Exact Assert Fails - Dynamic factors: - Why literal equality is brittle: ## Deterministic Checks - Check 1: - Check 2: ## Evidence To Collect - Evidence 1: - Evidence 2: ## Semantic Rubric - Pass when: - Fail when: - Needs review when: ## Execution Notes - Constraints: - Allowed variance: - Safety concerns: ## Final Verdict Format - verdict: pass | fail | needs_review - reason: - evidence: ``` # Example Task: verify that visiting a news site returns today's news rather than stale content. Good test design: - deterministic checks confirm the page loads and at least one article item is collected - evidence includes the visited site, page title, visible headlines, date clues, and final summary - semantic rubric passes when the result clearly reflects same-day or current reporting from the visited source - semantic rubric fails when headlines are outdated, unrelated, or invented - semantic rubric returns `needs_review` when freshness cannot be established from the evidence Bad test design: - `assert returned_text == "Today's news is ..."` # Guidance When using this skill: - keep traditional asserts for stable invariants - use semantic evaluation only where exact matching becomes brittle - prefer narrow rubrics over subjective judgment - require visible evidence before passing the test - state uncertainty explicitly instead of masking it # Deliverables When asked to design a flow test, provide: - a structured test spec - deterministic checks - an evidence schema - a semantic rubric - a final verdict format

flow-test

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

flow-test

flow-test

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement