multi-model-critique
# Multi-Model Critique
## Overview
Use this skill only for complex tasks. Route multiple models through the same 4-step loop (`Plan -> Execute -> Review -> Improve`), then run cross-critique and synthesis to produce a higher-quality final answer than any single-model draft.
## Trigger rule
Enable this skill only when the request explicitly sets `complex` to true (or equivalent wording such as “this is complex/deep”).
If `complex` is false, skip this skill and respond with normal single-model behavior.
## Inputs
Collect or confirm these inputs before execution:
- `complex`: boolean flag (must be true)
- `question`: user request
- `models`: list of ACP `agentId` values (typically 3)
- `constraints`: output format, language, length, deadlines, forbidden assumptions
- `ops`: optional runtime controls (`timeoutSec`, `maxRetries`, `maxRounds`, `budgetUsd`)
## File map (what each file does)
- `SKILL.md` (this file): orchestration policy, trigger conditions, and execution sequence.
- `references/prompt-templates.md`: reusable prompts for draft, critique, revision, and final synthesis (includes scoring rubric usage).
- `references/orchestration-template.md`: practical OpenClaw orchestration flow using `sessions_spawn`, `sessions_send`, and `sessions_history`.
- `references/output-schema.md`: machine-parseable JSON output schema for final result and per-model scoring.
- `scripts/build_round_prompts.py`: utility to generate per-model prompt files for repeated runs.
- `scripts/run_orchestration.py`: local helper that builds a run plan JSON (model mapping, round prompts, runtime settings).
## Workflow
### Step 1) Parallel draft round
Spawn one ACP session per model with the same task and constraints.
Per-model requirements:
- Follow the exact internal sequence: `Plan -> Execute -> Review -> Improve`
- Print all four sections explicitly
- End with `Draft Answer`
Use `sessions_spawn` with `runtime:"acp"` and explicit `agentId`.
### Step 2) Cross-critique round
Share peer `Draft Answer` outputs with each model and require structured critique:
- Strengths
- Weaknesses
- Missing assumptions/data
- Hallucination and confidence risks
- Concrete fix suggestions
Also require ranking of peer drafts with rationale.
### Step 3) Revision round
Send critique feedback back to each original model and request revision:
- Keep `Plan -> Execute -> Review -> Improve`
- Include `Changes from Critique`
- End with `Revised Answer`
### Step 4) Final synthesis round
Integrate revised answers into one user-facing output:
- Best final answer
- Why the synthesis is stronger than individual drafts
- Remaining uncertainties
- Optional next actions
## Scoring rubric (required in critique + synthesis)
Score each draft on a 1-5 scale:
- `accuracy`: factual correctness and internal consistency
- `coverage`: completeness against user request and constraints
- `evidence`: quality of assumptions and support
- `actionability`: usefulness for concrete decision/action
Default weighted score:
`0.40 * accuracy + 0.25 * coverage + 0.20 * evidence + 0.15 * actionability`
Use this score to justify rankings and the final selected direction.
## Prompting resources
- Use `references/prompt-templates.md` for canonical prompts.
- Use `scripts/build_round_prompts.py` when you need file-based prompt generation for repeated or batched runs.
- Use `scripts/run_orchestration.py` to generate a deterministic run-plan artifact for reproducible execution.
- Use `references/orchestration-template.md` for concrete OpenClaw tool-call flow.
## Required user-facing output shape
1. `Final Answer`
2. `Key Improvements from Critique`
3. `Uncertainties`
4. `Next Steps` (optional)
When machine consumption is needed, return JSON matching `references/output-schema.md`.
Do not expose private chain-of-thought. Provide concise reasoning summaries only.
## Failure handling
- One model fails: continue with remaining models and note reduced diversity.
- Two or more models fail: ask whether to retry or switch to single-model mode.
- Strong disagreement remains: present competing hypotheses and state what evidence would resolve them.
## Runtime defaults (recommended)
- `timeoutSec`: 180 per round per model
- `maxRetries`: 1 per failed model turn
- `maxRounds`: fixed at 4 (draft, critique, revision, synthesis)
- `budgetUsd`: optional hard stop when cost-sensitive
标签
skill
ai