etl-design
# ETL Design
ETL is **correctness under change**: schema drift, partial loads, retries, and reconciliation with upstream systems.
## When to Offer This Workflow
**Trigger conditions:**
- Batch loads into warehouse or data lake
- Choosing between CDC, snapshots, and incremental watermarks
- Missing rows, duplicates, or inconsistent aggregates downstream
**Initial offer:**
Use **six stages**: (1) source contract, (2) extract strategy, (3) transform rules, (4) load & dedupe, (5) validation, (6) operations & backfill). Confirm batch window and SLA.
---
## Stage 1: Source Contract
**Goal:** Document schema, primary keys, change indicators (`updated_at`, CDC log position), and access constraints (rate limits, read replicas).
---
## Stage 2: Extract Strategy
**Goal:** Full dump vs incremental watermark vs CDC—trade freshness, source load, and complexity.
### Practices
- CDC for large sources; snapshots for small or infrequent tables
---
## Stage 3: Transform Rules
**Goal:** Deterministic transforms; surrogate keys; business rules versioned; handling of deletes (tombstones vs hard deletes).
---
## Stage 4: Load & Dedupe
**Goal:** Upsert keys; partitions; rerunnable jobs with same batch id producing the same outcome (idempotent load).
---
## Stage 5: Validation
**Goal:** Row counts, checksums, key uniqueness, referential checks; alert on threshold breaches.
---
## Stage 6: Operations & Backfill
**Goal:** Replay by date range; monitor lag; dead-letter or quarantine bad rows with reason codes.
---
## Final Review Checklist
- [ ] Source contract and keys documented
- [ ] Extract mode matches SLA and source constraints
- [ ] Transforms deterministic and versioned
- [ ] Idempotent load strategy
- [ ] Validation and reconciliation defined
## Tips for Effective Guidance
- Plan for late-arriving facts and slowly changing dimensions in analytics paths.
- Pair with **data-pipelines** for orchestration and monitoring.
## Handling Deviations
- Near-real-time: document micro-batch or streaming semantics separately.
标签
skill
ai