autoresearch

# autoresearch Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. ## Triggers Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on. ## Description Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects. ## How It Works ``` ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ 1. BASELINE │────▶│ 2. MUTATE │────▶│ 3. EVALUATE │────▶│ 4. DECIDE │ │ Score the │ │ Change one │ │ Run scoring │ │ Better? │ │ current │ │ thing │ │ function │ │ Keep : Revert│ │ version │ │ │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └──────┬───────┘ │ Loop back to 2 ``` ## Instructions ### Step 1: Identify the Mutable File The **mutable file** is the thing you're optimizing. It can be: - A SKILL.md prompt/instructions - A trading strategy config (thresholds, parameters) - A content template (YouTube script format, ad copy structure) - Any text file where changes produce measurable differences Create or identify this file. Example: ``` my-skill/ ├── SKILL.md ← this is your mutable file ├── eval/ │ ├── test_cases.json │ └── score.py ``` ### Step 2: Create an Evaluation Function Your eval function must: 1. **Take the current mutable file as input** 2. **Run it against test cases** 3. **Return a numeric score** (higher = better) The eval can be anything: - **LLM-as-judge**: Send output to an LLM, ask it to score 1-100 - **Backtest**: Run a strategy against historical data, measure Sharpe/returns - **A/B metrics**: CTR, engagement, conversion rate - **Binary pass/fail**: Count how many test cases pass out of N Template eval function (customize for your domain): ```python # eval/score.py import json import sys def evaluate(mutable_file_path: str, test_cases_path: str) -> float: """ Score the current version of the mutable file. Returns a float — higher is better. """ with open(mutable_file_path) as f: current_version = f.read() with open(test_cases_path) as f: test_cases = json.load(f) scores = [] for case in test_cases: # YOUR SCORING LOGIC HERE # Example: run the prompt, compare output to expected score = run_and_score(current_version, case) scores.append(score) return sum(scores) / len(scores) if __name__ == "__main__": score = evaluate(sys.argv[1], sys.argv[2]) print(f"SCORE: {score}") ``` ### Step 3: Run the Autoresearch Loop The loop follows this exact pattern: ``` 1. Git init (if not already) — every experiment is a commit 2. Run eval on current version → get BASELINE score 3. For each experiment (1..N): a. Read the current mutable file b. Generate a MUTATION (change one thing — a threshold, a phrase, a rule) c. Write the mutated version d. Run eval → get NEW score e. If NEW > BASELINE: - Git commit with message: "exp-{N}: {description} | score: {baseline} → {new}" - Update BASELINE = NEW - Log: "✅ KEPT — improvement" f. If NEW <= BASELINE: - Git checkout the mutable file (revert) - Log: "❌ REVERTED — no improvement" 4. Print final summary: experiments run, improvements found, final score ``` #### Agent Instructions for Running the Loop When the user says "run autoresearch on X", follow this procedure: 1. **Locate the mutable file** — ask the user or infer from context 2. **Locate or create the eval function** — the user must have a way to score 3. **Initialize git tracking** in the project directory 4. **Run baseline eval** — record the starting score 5. **Begin experiment loop:** - Read the mutable file - Think about what single change might improve the score - Make the change (be specific — change ONE thing per experiment) - Run eval - Keep or revert based on score - Log the result 6. **Continue for N experiments** (default: 20, or until user stops) 7. **Report results:** - Starting score → Final score - Number of experiments run - Number of improvements kept - Summary of what changes worked #### Mutation Strategy Good mutations change ONE thing at a time: - **Numeric parameters**: Adjust thresholds, weights, window sizes - **Prompt wording**: Rephrase instructions, add/remove constraints - **Structure**: Reorder sections, add examples, remove redundancy - **Rules**: Add a new rule, tighten an existing one, relax a constraint Bad mutations change everything at once — you can't learn what worked. ### Step 4: Git Tracking Every experiment MUST be tracked in git: ```bash # Before starting git init git add -A git commit -m "baseline: score {X}" # After each successful mutation git add -A git commit -m "exp-{N}: {what changed} | {old_score} → {new_score}" # After each failed mutation git checkout -- {mutable_file} ``` This gives you: - Full history of every experiment - Ability to diff any two versions - Easy rollback if something breaks - A log of what mutations worked vs didn't ## Proven Results ### Case Study 1: Gold Trading Strategy - **Task**: Optimize XAUUSD trading parameters - **Mutable file**: Strategy config (EMA periods, momentum threshold, position sizing) - **Eval function**: Backtest on historical data → Sharpe ratio - **Baseline**: Sharpe 5.80 - **Experiments**: 86 in 25 minutes - **Final**: Sharpe 12.23 (+111%) - **Key discoveries**: Momentum threshold 0.003→0, EMA 8/24→5/11, position sizing optimization - See: `references/gold-results.md` ### Case Study 2: YouTube Shorts Scripts - **Task**: Optimize script-writing prompt for higher quality scores - **Mutable file**: SKILL.md prompt instructions - **Eval function**: LLM judge scoring 1-100 - **Baseline**: 94.3/100 - **Experiments**: 11 - **Final**: 96.7/100 (+2.5%) - **Key discoveries**: Atomic sentences, strict 40-50 word range, stronger negative examples - See: `references/youtube-results.md` ## Example Usage **User**: "Run autoresearch on my email subject line skill" **Agent workflow**: 1. Read the skill's SKILL.md (mutable file) 2. Create eval: generate 20 test emails → score subject lines with LLM judge (1-100 on open-rate prediction) 3. Baseline: 72.4/100 4. Experiment 1: Add "use numbers in subject lines" → 74.1 ✅ KEPT 5. Experiment 2: Add "max 6 words" → 71.8 ❌ REVERTED 6. Experiment 3: Add "start with a verb" → 75.3 ✅ KEPT 7. ... continue for 20 experiments 8. Final: 79.2/100 (+9.4%) **User**: "Optimize my trading strategy config" **Agent workflow**: 1. Read strategy.json (mutable file) 2. Eval: run backtest script → Sharpe ratio 3. Baseline: Sharpe 2.1 4. Experiment 1: Lower stop-loss from 2% to 1.5% → Sharpe 2.3 ✅ 5. Experiment 2: Increase EMA fast period 12→15 → Sharpe 1.9 ❌ 6. ... continue 7. Final: Sharpe 3.8 (+81%)

autoresearch

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

autoresearch

autoresearch

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement