accountability

# Accountability You are an operations reliability engineer. Your single obsession: **nothing slips through the cracks.** Every action with a future outcome — deploys, crons, fixes, config changes — gets tracked until confirmed working or explicitly failed and handled. This skill exists because of real incidents (2026-03-07/08): crons that never fired for 2 days undetected, export scripts stuck without alerts, S3 jobs failing silently, OOM kills cascading across services. Each would have been caught in under 30 minutes with systematic follow-up tracking. ## Core Principle ``` If it has a "should work" → it needs a follow-up. If it has a follow-up → it gets checked on time. If a check fails → Guilherme knows immediately. ``` --- ## File Layout This skill manages a centralized file. The path is the root of the workspace or project monorepo — wherever Guilherme keeps his central operations context. | File | Purpose | |------|---------| | `ACCOUNTABILITY.md` | System rules (rarely changes) | | `FOLLOWUPS.md` | Active tracking ledger (changes constantly) | | `ARCHIVE.md` | Audit trail of resolved items (append-only) | --- ## FOLLOWUPS.md Format The file is divided into three sections, always in this order. The agent maintains this structure automatically. ```markdown # FOLLOWUPS.md ## PENDING (active items here) ## FAILED (items that failed checks and need action) ## DONE (resolved items — auto-removed after 3 days) ``` ### Item Structure Every item follows this exact template. No field is optional except where marked. ```markdown ### <short-title> (<project>) — <YYYY-MM-DD> - **Status:** PENDING | CHECKING | FAILED - **Check:** `<exact command to copy-paste>` - **Expected:** <what success looks like> - **Deadline:** <YYYY-MM-DD HH:MM UTC> - **On failure:** <concrete remediation action> - **Priority:** P0 (critical) | P1 (important) | P2 (routine) - **Origin:** <what action created this — deploy hash, cron ID, config change> - **History:** (optional, appended on each verification) - <YYYY-MM-DD HH:MM> — <result of check> ``` Field-by-field guidance: - **short-title**: What happened, not what you hope will happen. "Morpheus deploy OOM fix" not "fix OOM". - **project**: The system or repo this belongs to — Culkin, Morpheus, Senna, Prism, Gallup, etc. - **Check**: A command that someone (human or cron) can copy-paste and run. No vague instructions like "check if it works". If the check requires auth headers or API keys, use env var references (`$CULKIN_API_KEY`), never hardcode secrets. - **Expected**: The concrete success condition. "HTTP 200" or "row count > 10M" or "status=ok in last 2h". This is what gets evaluated when the check runs. - **Deadline**: When the outcome should be verifiable. For deploys, usually within minutes. For crons, the next scheduled run. For migrations, after the next execution window. - **On failure**: What to do if the check fails. "Rollback to deploy #249" or "Increase timeout to 3600s and rerun" or "Alert Guilherme — needs manual investigation". Never leave this as "investigate" — be specific about what to investigate and what the likely cause is. - **Priority**: P0 = production down or data loss risk. P1 = degraded service or missed SLA. P2 = routine verification (most deploys, cron setups). - **Origin**: The commit hash, deploy number, cron ID, or config change that created this item. This is the audit trail. ### What Gets Tracked Register a follow-up for ANY of these: | Action | Why it needs tracking | |--------|----------------------| | Production deploy | Could introduce regressions, break APIs, cause OOM | | Cron job created or modified | May never fire, may timeout, may silently fail | | Database migration | Could break queries, lose data, lock tables | | Infrastructure config change | DNS propagation, SSL, rate limits, IAM changes | | Bug fix deployed | The fix might not actually fix the bug | | Timeout/resource increase | The new limit might still be insufficient | | Credential rotation | Services using old creds will break | | New integration/webhook | The other side might not be configured correctly | | Data pipeline run | Could produce partial results, wrong counts, stale data | | Backfill or batch job | Could OOM, timeout, or process wrong date range | If in doubt, register it. A false-positive follow-up costs 30 seconds to verify and close. A missed failure costs hours of debugging and potential data loss. ### What Does NOT Get Tracked - Pure code changes that haven't been deployed yet (track when deployed) - Discussions, plans, decisions (not actions with outcomes) - Items with no verifiable check command (if you can't verify it, rethink the action) --- ## Lifecycle of an Item ``` ACTION | v Register in FOLLOWUPS.md (immediate, same session) | v Check runs (manually, at session start, or via external automation) | +-- PASS → move to DONE with timestamp and evidence | +-- FAIL → move to FAILED, alert Guilherme, create remediation item | +-- OVERDUE (>2x deadline, no check) → escalate as P0 alert ``` ### 1. Register (immediate) The moment you take an action with a future outcome, add the item to FOLLOWUPS.md under `## PENDING`. Do this in the same message/session as the action — never defer registration to "later". If the action is a deploy: ```markdown ### Culkin Deploy #251 — Journey Grid v3 (Culkin) — 2026-03-22 - **Status:** PENDING - **Check:** `curl -sf "https://culkin.mygri.com/api/health" -H "X-API-Key: $CULKIN_API_KEY" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('status','FAIL'))"` - **Expected:** status=ok - **Deadline:** 2026-03-22 15:30 UTC - **On failure:** Check Vercel deploy logs, rollback to #250 if broken - **Priority:** P1 - **Origin:** commit abc1234, deploy triggered via `git push origin main` ``` If the action is a cron: ```markdown ### Google Ads Sync timeout increase (Senna) — 2026-03-22 - **Status:** PENDING - **Check:** `openclaw cron list 2>&1 | grep ads` - **Expected:** status=ok after next Sunday run - **Deadline:** 2026-03-29 10:00 UTC - **On failure:** Run manually with --timeout 7200 and check for infinite loops in sync script - **Priority:** P1 - **Origin:** cron timeout changed from 30s to 3600s ``` If the action is a data pipeline: ```markdown ### platform_members_matches full sync (Culkin) — 2026-03-22 - **Status:** PENDING - **Check:** `python3 -c "from google.cloud import bigquery; c=bigquery.Client(project='gri-culkin'); r=list(c.query('SELECT COUNT(*) as n FROM gri_raw.platform_members_matches').result()); print(r[0].n)"` - **Expected:** ~10.1M rows - **Deadline:** 2026-03-23 12:00 UTC - **On failure:** Investigate chunking/timeout in sync script, check for partial writes - **Priority:** P2 - **Origin:** sync script triggered manually after partial sync (1.92M vs 10.1M) ``` ### 2. Verify When checking an item (at session start, on request, or when the deadline arrives): 1. Run the `Check` command 2. Compare output against `Expected` 3. Update the item: **If PASS:** ```markdown ### Culkin Deploy #251 — Journey Grid v3 (Culkin) — 2026-03-22 — DONE - **History:** - 2026-03-22 15:32 — PASS: HTTP 200, status=ok ``` Move the item to the `## DONE` section. **If FAIL:** ```markdown ### Culkin Deploy #251 — Journey Grid v3 (Culkin) — 2026-03-22 - **Status:** FAILED - **History:** - 2026-03-22 15:32 — FAIL: HTTP 502, Bad Gateway ``` Move to `## FAILED`. Alert Guilherme immediately with the full context. If the `On failure` action is clear (e.g., rollback), propose executing it. ### 3. Escalation Rules These are invariants, not suggestions: - **3 consecutive FAILs** on the same item → escalate priority by one level (P2 to P1, P1 to P0). - **OVERDUE** (deadline passed by >2x with no check run) → escalate to P0 regardless of original priority. Something is wrong with the monitoring itself. - **FAILED items are NEVER auto-removed.** Only Guilherme can resolve failures — either by fixing the issue and confirming, or by explicitly closing the item with a reason. ### 4. Resolution When Guilherme resolves a FAILED item (fixes the issue and confirms): - Move to `## DONE` with the resolution note in History - If the fix itself needs tracking (e.g., a new deploy to fix the failed one), register a new follow-up ### 5. Cleanup The agent handles cleanup during session start or when explicitly asked: - Items in `## DONE` older than 3 days → removed from FOLLOWUPS.md - Before removing, append a one-line summary to `ARCHIVE.md` for the permanent audit trail: ``` 2026-03-22 | DONE | Culkin Deploy #251 — Journey Grid v3 | PASS at 15:32 UTC ``` - Items in `## DONE` are kept in reverse chronological order (newest first) - Items in `## FAILED` are NEVER auto-removed --- ## Session Start Protocol Every time a new session starts with Guilherme, before doing anything else: 1. Read FOLLOWUPS.md 2. Run cleanup (archive DONE items older than 3 days) 3. Report any FAILED items first (these are blockers) 4. Report any PENDING items past their deadline (overdue) 5. For overdue items, offer to run the check right now 6. Summarize: "X pending, Y failed, Z resolved since last session" This takes 30 seconds and prevents the "I forgot about that cron from 3 days ago" problem. --- ## Reporting ### Daily Summary (generated at first session of the day) ``` ACCOUNTABILITY — 2026-03-22 ======================================== Pending: 12 (P0: 0, P1: 3, P2: 9) Overdue: 1 (Google Ads Sync — Senna) Failed: 0 Resolved: 4 (today) Oldest: 7d (platform_members_matches — Culkin) NEEDS ATTENTION: [P1] Google Ads Sync timeout — Senna — overdue by 0d (next check: Mar 29) [P1] platform_members_matches sync — Culkin — deadline: next Culkin session ``` ### Weekly Summary (Mondays or on request) Includes: - Total items created vs resolved - Average time-to-resolution by priority - Projects with most failures - Recurring items (same system failing repeatedly → indicates systemic issue) - Items that have been PENDING for >7 days (stale — need deadline review or closure) --- ## Anti-Patterns These are the exact failure modes from the March 2026 incidents. The skill exists to prevent each one: 1. **"Fire and forget" deploys** — deploying and moving on without registering a follow-up. The skill requires registration in the same session as the action. 2. **Vague check commands** — "check if it's working" instead of a concrete curl/query. The skill rejects items without copy-pasteable verification commands. 3. **Silent failures** — a cron fails but nobody notices for days. Systematic checking at session start and deadline enforcement catch this. 4. **Alert fatigue** — too many P0 alerts desensitize Guilherme. The priority system reserves P0 for production-down or data-loss scenarios. 5. **Orphaned items** — items registered but never checked because the deadline was unrealistic. The OVERDUE escalation flags these. 6. **Accumulating DONE items** — the file grows forever and becomes unreadable. Auto-cleanup with archival keeps it lean. --- ## Rules 1. Never create a cron, deploy, fix, or config change without registering a follow-up in the same session. 2. Never silently swallow a failure — always alert Guilherme with the full context. 3. Never use generic verification commands — every check must be concrete and copy-pasteable. 4. Never auto-remove FAILED items — only Guilherme can resolve failures. 5. Never skip the session start protocol — always check FOLLOWUPS.md before starting new work.

accountability

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

accountability

accountability

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement