llm-eval-router

**Last used:** 2026-03-24 **Memory references:** 7 **Status:** Active # llm-eval-router Set up a production-quality shadow evaluation pipeline that automatically promotes local Ollama models when they statistically prove they match cloud model quality — reducing inference costs with evidence, not hope. ## The core idea Run every task through your best local model (shadow) in parallel with your cloud baseline (ground truth). A lightweight judge ensemble scores the local output. After 200+ runs, if the local model hits 0.95 mean score, promote it to handle that task type in production. Demote it automatically if quality drops. ## When to use - You're paying for Claude/GPT API calls on tasks that don't need that quality - You have Ollama running locally with capable models (qwen2.5, phi4, mistral, etc.) - You want evidence-based cost reduction, not blind routing - You have defined task types: summarize, classify, extract, format, analyze, RAG ## When NOT to use - Tasks that require real-time web knowledge (use cloud) - Tasks with strict latency requirements < 2 seconds (local models on CPU are slow) - Tasks with high safety stakes (always use cloud with safety filters) - You don't have Ollama or a Mac/Linux machine with enough RAM (8GB+ per model) ## Prerequisites - Ollama installed and running (ollama.com) - At least one capable model: `ollama pull qwen2.5` or `ollama pull phi4` - Python 3.10+ - API keys: Anthropic (ground truth) + OpenAI (judge) — Gemini optional (tiebreaker) - Langfuse for observability (self-hosted or cloud) — optional but strongly recommended ## Network & Privacy This skill makes outbound API calls to: - **Anthropic API** — to generate ground truth baseline responses (every accumulation cycle) - **OpenAI API** — for judge scoring (sampled at 15% of runs) - **Google Gemini API** — tiebreaker judge only (when primary judges disagree by ≥0.20) **What stays local:** - All Ollama model inference runs entirely on your device - Scored run data is stored on disk in `data/scores/*.json` - No telemetry, analytics, or data collection of any kind - No data is sent anywhere other than the explicit API calls above **Langfuse** (optional) can be self-hosted or cloud. If self-hosted, all observability data stays on your network. ## Core concepts ### 6-Dimension Evaluation Every response is scored on: | Dimension | Default weight | Analyze weight | What it measures | |---|---|---|---| | Structural | 25% | **10%** | Format compliance, required keys present | | Semantic | 25% | **40%** | Meaning equivalence to ground truth | | Factual | 20% | 25% | No hallucinated facts/numbers/entities | | Completion | 15% | 18% | Task fully addressed | | Tool use | 10% | 4% | Correct tool/format selection | | Latency | 5% | 3% | Within acceptable bounds | **Important:** Use per-task weight overrides. The default 25/25 split treats structural accuracy equally with semantic similarity — which works for extract/classify/format tasks (where exact format matters) but is wrong for open-ended analysis. `difflib.SequenceMatcher` on two prose analyses of the same question scores ~0.29 even when they're semantically identical. With structural weight at 25%, this alone caps analyze scores at ~0.59. ```python # src/evaluator.py — per-task weight profiles TASK_WEIGHT_OVERRIDES = { "analyze": { "structural_accuracy": 0.10, # difflib is NOT meaningful for prose "semantic_similarity": 0.40, # cosine over embeddings captures meaning "factual_drift": 0.25, "task_completion": 0.18, "tool_correctness": 0.04, "latency_score": 0.03, }, "code_transform": { "structural_accuracy": 0.15, "semantic_similarity": 0.35, "factual_drift": 0.20, "task_completion": 0.20, "tool_correctness": 0.07, "latency_score": 0.03, }, } ``` **Also:** For analyze tasks, constrain output structure via system_prompt so GT and candidates produce comparably-formatted responses (Finding/Recommendation/Confidence/Reasoning). This reduces Layer 2 drift and improves difflib scores even at reduced weight. ### Judge ensemble - **Primary judges** (15% sampling rate): Claude Sonnet + gpt-4o-mini score independently - **Tiebreaker** (only when |score_A - score_B| ≥ 0.20): Gemini 2.5-flash - **Unsampled runs** (85%): Layer 1+2 validators only (deterministic, free) - **Promotion gates** always trigger full judge evaluation regardless of sampling rate ### Layer 1+2 validators (free, deterministic) - **Layer 1**: JSON validity, required key presence, forbidden pattern check - **Layer 2**: Drift detection — novel entities/numbers/URLs not in ground truth These run on every response at zero cost. Judges only run when L1+L2 pass and the sampling rate triggers. ### Promotion / Demotion - **Promote**: 200+ runs, rolling mean ≥ 0.95 for a model/task pair - **Demote**: rolling 7-day pass rate < 0.92 - **Control floor**: one model (phi4, granite4, or similar) serves as the measured floor — any model scoring below it should be flagged, not promoted ## Implementation steps ### Step 1 — Define your task types Create `config/task_types.yaml`: ```yaml tasks: - id: summarize description: "Summarize a document in N sentences" require_json: false judge_dimensions: [semantic, factual, completion] - id: classify description: "Classify text into one of N categories" require_json: true # response must be valid JSON judge_dimensions: [structural, semantic, completion] - id: extract description: "Extract structured data from unstructured text" require_json: true judge_dimensions: [structural, factual, completion] - id: format description: "Reformat content to match a template" require_json: false judge_dimensions: [structural, semantic, completion] ``` ### Step 2 — Set up the router The router assigns each task to a model using a round-robin strategy during burn-in (building n), then switches to confidence-weighted routing after promotion. ```python # src/router.py — simplified version class Router: def __init__(self, candidates: list[str], control_floor: str): self.candidates = candidates self.control_floor = control_floor self._rr_counters = defaultdict(int) def route(self, task_type: str, confidence_tracker: ConfidenceTracker) -> str: """Return the best model for this task type.""" promoted = confidence_tracker.get_promoted(task_type) if promoted: return promoted # use promoted model directly # Round-robin during burn-in for fair exposure idx = self._rr_counters[task_type] % len(self.candidates) self._rr_counters[task_type] += 1 return self.candidates[idx] ``` ### Step 3 — Ground truth comparison For each task, run it through BOTH the local model (candidate) and the cloud baseline (ground truth). Never use the ground truth response in production — it's only for evaluation. ```python async def evaluate_pair(prompt: str, local_response: str, gt_response: str, task_type: str) -> float: # Layer 1: deterministic l1_score = validators.layer1(local_response, task_type) if l1_score == 0.0: return 0.0 # hard fail — safety or format violation # Layer 2: heuristic drift l2_score = validators.layer2(local_response, gt_response) # Sample judges (15%) if random.random() < JUDGE_SAMPLE_RATE: sonnet_score = await judge_sonnet(prompt, local_response, gt_response) mini_score = await judge_gpt4o_mini(prompt, local_response, gt_response) if abs(sonnet_score - mini_score) >= 0.20: gemini_score = await judge_gemini(prompt, local_response, gt_response) final = median([sonnet_score, mini_score, gemini_score]) else: final = (sonnet_score + mini_score) / 2 return weighted_score(l1_score, l2_score, final) else: return weighted_score(l1_score, l2_score, judge_score=None) ``` ### Step 4 — Confidence tracker Track scores per model/task pair on disk (so restarts don't lose data): ```python # src/scoring/confidence.py — simplified @dataclass class ModelStats: model_id: str task_type: str scores: list[float] # all scores (None excluded) promoted: bool = False demoted: bool = False @property def mean(self) -> float: return sum(self.scores) / len(self.scores) if self.scores else 0.0 @property def n(self) -> int: return len(self.scores) def should_promote(self) -> bool: return self.n >= 200 and self.mean >= 0.95 and not self.promoted def should_demote(self) -> bool: recent = self.scores[-50:] # last 50 pass_rate = sum(1 for s in recent if s >= 0.85) / len(recent) return pass_rate < 0.92 and not self.demoted ``` ### Step 5 — Accumulator loop Run this on a cron (every 10-20 minutes via launchd/systemd): ```python # run_accumulate.py async def accumulate(): task_type = pick_next_task() # round-robin across task types prompt, gt_response = generate_task(task_type) # call cloud baseline for candidate in router.get_candidates(task_type): local_response = await ollama_client.complete(candidate, prompt) score = await evaluate_pair(prompt, local_response, gt_response, task_type) confidence_tracker.record(candidate, task_type, score) if confidence_tracker.should_promote(candidate, task_type): router.promote(candidate, task_type) langfuse.log_promotion(candidate, task_type, confidence_tracker.stats(candidate, task_type)) ``` ### Step 6 — Routing policy ```yaml # config/routing_policy.yaml control_floor_model: phi4:latest # never promote below this model's score task_policies: policy_check_high_risk: never_local: true # these tasks always use cloud model summarize: min_score_for_routing: 0.85 fallback_chain: [qwen2.5, llama3.1, phi4] classify: min_score_for_routing: 0.90 # higher bar for classification fallback_chain: [qwen2.5, granite4, llama3.1] ``` ### Step 7 — API Expose a simple HTTP API (FastAPI): ``` POST /run — route a task through the best available model GET /health — service status + promoted models + ollama connectivity GET /status — full scoreboard (model × task × mean × n) GET /report — cost heatmap + efficiency analysis ``` ## Key lessons learned (from 900+ production runs) **What worked:** - phi4 as control floor: a measured floor model prevents "promoted because everyone else is also bad" errors. If the floor model beats a candidate, flag it — don't promote. - Thinking token stripping: CoT models (deepseek-r1, qwen2.5-coder with reasoning) must have `<think>...</think>` blocks stripped before evaluation. Otherwise Layer 2 drift detection flags the reasoning chain as hallucinated content. - `None ≠ 0.0` for unsampled runs: a run where no judge scored is not a failing run. Store `None`, exclude from mean. Mixing None with 0.0 poisons the mean. - `require_json: False` for plain-text tasks: classify and extract tasks that return formatted text (not JSON objects) will fail Layer 1 if you require JSON. Separate the "is the format correct" check from "is it valid JSON." - **Per-task weight overrides**: do not use one weight profile for all task types. Structural accuracy (difflib) is wrong for prose analysis — use semantic similarity as the primary signal for open-ended tasks. This lifted analyze mean from 0.44–0.59 to 0.70. - **Structured output prompts for analyze tasks**: add a `system_prompt` that specifies an exact output format (Finding/Recommendation/Confidence/Reasoning). Both GT and candidates follow the same template, improving structural alignment and reducing drift penalty. Without this, Layer 2 drift fires on differently-phrased but correct analyses. - **MCP server for agentic access**: expose CP as MCP tools (`run_task`, `get_status`, `get_champions`, `get_promotion_timeline`, `get_cost_heatmap`). Lets an LLM agent query evaluation state without bespoke integration work. **What didn't work:** - Large models (>9GB): gpt-oss:20b and similar required 39+ second inference — the latency dimension alone tanks the composite score. Practical ceiling is ~9GB models on 24GB unified memory to avoid GPU memory swapping. - 100% judge sampling: runs through the full Claude+GPT+Gemini panel on every evaluation costs more in judge API fees than you save by routing locally. Sample at 15%. - Chroma 1.5.1 with Python 3.14: Pydantic V1 BaseSettings incompatibility. Use qdrant or numpy cosine store instead. - **One-size-fits-all weight profiles**: defining global weights at system init and never overriding per task type led to all analyze evals silently failing for 112+ runs. Lesson: evaluate your evaluator's scores by task type early — if a whole task type caps at a suspicious ceiling (e.g. 0.59), the metric is wrong, not the models. ## Expected timeline With a 20-minute accumulator cadence and 9 candidates × 7 task types: - First 50 runs per model: ~5 hours - First promotions (200 runs): ~1-2 days per model/task pair - Stable routing layer: 1-2 weeks ## Cost estimate Per accumulation cycle (one task, one model): - Ground truth: ~$0.002 (Claude Sonnet, ~500 input + 200 output tokens) - Judge sample (15%): ~$0.003 (Sonnet + GPT-4o-mini) - Local model: $0 (Ollama, on-device) At 6 runs/hour × 24 hours: ~$0.70/day during burn-in. After first promotions: drops to ~$0.10/day (90%+ of task volume local).

llm-eval-router

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

llm-eval-router

llm-eval-router

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement