incident-response-lifecycle

# Incident Response Lifecycle Structured process management for network incidents from detection through post-incident review. This skill covers the organizational coordination layer: severity classification, escalation, role assignment, stakeholder communication, recovery coordination, and root cause analysis. It does not cover technical evidence collection, device forensics, or containment execution — use the incident-response-network skill for network-level evidence gathering and forensic analysis. The procedure follows the operational lifecycle shape: detect and classify the incident, triage and escalate to the right people, coordinate the investigation across teams, manage communications to all audiences, drive resolution and recovery, then conduct a blameless post-incident review. See `references/communication-templates.md` for notification templates by audience and severity level. See `references/rca-framework.md` for the 5-whys methodology, fishbone diagram guidance, and post-mortem document structure. ## When to Use - **Service-affecting incident declared** — a P1 or P2 event requires formal incident management with role assignment and communications - **Escalation decision needed** — determining who to notify at what severity level and when to engage vendor support or management - **Multi-team coordination required** — investigation spans network, security, application, and infrastructure teams needing a single command structure - **Customer or regulatory notification required** — incident has external communication obligations (SLA breach, data exposure, regulatory reporting) - **Post-incident review facilitation** — scheduling, structuring, and running blameless post-mortems with 5-whys root cause analysis - **Incident metrics reporting** — collecting MTTD, MTTI, MTTR, and recurrence data for continuous improvement ## Prerequisites - **Incident management authority** — the person initiating this process must have authorization to declare incidents and assign roles within the organization - **Contact directory** — current on-call rosters, escalation contacts for management, vendor TAC numbers, and regulatory notification contacts must be accessible - **Communication channels** — bridge call infrastructure (conference line or collaboration tool), status page access, and email distribution lists for each stakeholder group must be established - **Incident tracking system** — a ticketing system to record the incident, track actions, and maintain the timeline of events - **Defined severity criteria** — organizational agreement on what constitutes P1 through P4 severity (see Threshold Tables below for a reference framework) ## Procedure Follow these six steps in sequence. Steps 3 and 4 run in parallel once roles are assigned — investigation coordination and communication management proceed simultaneously. Each step references templates from `references/communication-templates.md` and methodology from `references/rca-framework.md` where applicable. ### Step 1: Detection and Classification Classify the incident by severity, type, and scope to determine the appropriate response level. **Severity assignment** — apply the P1–P4 taxonomy from the Threshold Tables section. Base severity on the highest-impact criterion met. When multiple criteria apply at different levels, the highest governs. **Incident type classification** — categorize as outage (service unavailable), degradation (reduced capacity), security (unauthorized access or data exposure), or data loss (corruption or deletion). **Scope determination** — assess whether the incident affects a single device, a network segment, an entire site, or multiple sites. Scope drives staffing, communication breadth, and recovery complexity. **Initial impact assessment** — estimate affected user count, impacted services and their business criticality, data at risk, and revenue impact per hour. Record estimates in the incident ticket. ### Step 2: Triage and Escalation Assign roles, notify stakeholders, and set response timeline expectations based on the severity classification from Step 1. **Role assignment** — every P1 or P2 incident needs four named roles: Incident Commander (IC) — owns the incident end-to-end and makes escalation decisions; Technical Lead — coordinates diagnostics and synthesizes findings; Communications Lead — drafts stakeholder notifications and manages the status page; Scribe — maintains the real-time timeline and records bridge call decisions. For P3, IC and Technical Lead may be combined. P4 uses normal operations workflows. **Escalation matrix execution** — notify by severity: P1 — all four roles plus engineering management, VP/director on-call, vendor TAC if vendor equipment is involved, executive notification within 30 minutes. P2 — all four roles plus engineering management within 1 hour. P3 — Technical Lead plus team lead within 4 hours. P4 — assigned engineer via normal ticket queue. **Response timeline expectations:** P1 — bridge in 15 minutes, first update in 30 minutes, then every 30 minutes. P2 — bridge in 30 minutes, first update in 1 hour, then every 2 hours. P3 — initial assessment in 4 hours, daily updates. P4 — acknowledgment within 1 business day. **Vendor engagement criteria** — engage vendor TAC when the incident involves hardware failure, software defects requiring patches, or when internal triage has not identified root cause within the severity time window. ### Step 3: Investigation Coordination Coordinate the technical investigation across teams and evidence sources. For network-level evidence collection (device state, routing tables, interface data, log retrieval), reference the incident-response-network skill — this step focuses on organizing the investigation, not executing forensic commands. **Evidence collection tasking** — assign team members to collect evidence from relevant domains: network devices (via incident-response-network procedures), application logs, infrastructure metrics, and security tooling alerts. Each assignee reports findings to the Technical Lead. **Parallel investigation streams** — for complex incidents, run multiple investigation threads simultaneously. Common parallel tracks: (1) symptom analysis — what is failing and for whom, (2) change correlation — what changed recently (deployments, config modifications, maintenance), (3) external factors — upstream provider issues, DDoS, DNS resolution failures. **Hypothesis tracking** — maintain a running list of hypotheses with current status (investigating, confirmed, ruled out). Each hypothesis should have an owner and a validation method. Update the list on every bridge call. **Timeline of events (ToE)** — the Scribe maintains a running chronological log of when events occurred, when they were detected, what actions were taken, and what was discovered. The ToE becomes the foundation for the post-incident review in Step 6. **Subject matter expert engagement** — when investigation stalls or enters an unfamiliar domain, escalate to specialists. Define clear handoff: what has been tried, what data is available, and what specific question needs answering. ### Step 4: Communication Management Manage stakeholder communications throughout the incident. Use the templates in `references/communication-templates.md` for consistent messaging across audiences. **Stakeholder notification by audience** — executive summary (business impact, estimated resolution, customer exposure — no technical detail), technical detail (root cause hypothesis, diagnostics, remediation plan — delivered on bridge call), customer-facing (service impact, workaround if available, estimated resolution — via status page), regulatory (formal notification per compliance framework when required). Use templates from `references/communication-templates.md`. **Status update cadence** — follow severity-based cadence from Step 2. Each update includes: current status, progress since last update, next planned action, and revised time-to-resolution estimate. **Bridge call management** — the IC runs calls with a fixed agenda: (1) technical status from Tech Lead, (2) communication status from Comms Lead, (3) hypothesis updates, (4) decisions needed, (5) action items with owners and deadlines. Keep calls focused — park side discussions as action items. **External notification requirements** — track regulatory reporting deadlines, law enforcement notification when criminal activity is suspected, customer SLA breach notification per contractual terms, and vendor escalation for ongoing support. ### Step 5: Resolution and Recovery Drive service restoration through validated recovery steps with monitoring to confirm the fix holds. **Recovery validation criteria** — before declaring resolved, confirm: (1) service health checks return normal for all affected components, (2) monitoring dashboards show green for at least 15 minutes (P1) or 30 minutes (P2), (3) no new related alerts during observation, (4) affected users confirm restoration (sample check for large populations). **Phased restoration** — for multi-layer network incidents, restore in order: core infrastructure → distribution layer → access layer → end-to-end verification. Verify each phase before proceeding. Do not restore all layers simultaneously — cascading failures during recovery are worse than a phased approach. **Back-out plan execution** — if the fix causes new issues, execute the pre-defined rollback. Every remediation action should have a documented rollback method before execution. **Enhanced monitoring period** — maintain heightened monitoring after resolution: P1 for 24 hours, P2 for 12 hours, P3 through the next business day. This means reduced alert thresholds on affected systems, active watch by on-call, and immediate re-escalation if symptoms recur. **Incident closure** — send closure notification to all stakeholders (template in `references/communication-templates.md`). Update the ticket with resolution summary, total duration, and final impact. Schedule the post-incident review. ### Step 6: Post-Incident Review Conduct a blameless post-incident review to identify root cause, contributing factors, and improvement actions. See `references/rca-framework.md` for the full methodology. **Scheduling** — hold the post-mortem within 72 hours of incident resolution while details are fresh. Invite all incident participants plus relevant stakeholders. Send the invitation using the template in `references/communication-templates.md`. **5-whys root cause analysis** — apply iteratively: for each "why" answer, ask "why" again until reaching a systemic root cause (typically 3–5 iterations). See `references/rca-framework.md` for worked examples and facilitation guidance. **Contributing factor categorization** — classify each contributing factor as process (missing runbook, unclear escalation path), people (training gap, staffing shortage), or technology (monitoring gap, single point of failure, software defect). This categorization guides the type of remediation action needed. **Action item classification** — assign each action item one of four dispositions: fix (eliminate the root cause), mitigate (reduce likelihood or impact), accept (risk is within tolerance, document rationale), or transfer (assign to another team or vendor). Every fix or mitigate action must have an owner, due date, and verification method. **Incident metrics** — collect and record: Mean Time to Detect (MTTD), Mean Time to Investigate (MTTI), Mean Time to Resolve (MTTR), total incident duration, number of customers affected, and whether this is a recurrence of a previous incident. Track these metrics over time to measure improvement trends. ## Threshold Tables ### Severity Classification Matrix | Severity | User Impact | Service Impact | Data Risk | Response SLA | |----------|-----------|----------------|-----------|-------------| | **P1 Critical** | >50% of users or all VIP users | Complete outage of revenue-generating service | Confirmed data breach or loss | Bridge in 15 min, updates every 30 min | | **P2 High** | 10–50% of users affected | Major degradation or redundancy loss on critical path | Suspected data exposure | Bridge in 30 min, updates every 2 hr | | **P3 Medium** | <10% of users, workaround exists | Partial degradation, non-critical service | No data risk identified | Assessment in 4 hr, updates daily | | **P4 Low** | Minimal or no user impact | Cosmetic, non-production, or fully redundant | None | Ack in 1 business day | ### Escalation and Role Matrix | Severity | Incident Commander | Technical Lead | Comms Lead | Scribe | Management | Executive | |----------|-------------------|---------------|------------|--------|-----------|-----------| | **P1** | Required | Required | Required | Required | Immediate | Within 30 min | | **P2** | Required | Required | Required | Optional | Within 1 hr | If SLA breached | | **P3** | Combined with Tech Lead | Required | Optional | No | Within 4 hr | No | | **P4** | No | Assigned engineer | No | No | Normal reporting | No | ### Enhanced Monitoring Duration | Severity | Monitoring Period | Alert Threshold | Re-escalation Trigger | |----------|------------------|----------------|-----------------------| | **P1** | 24 hours | Reduced by 20% | Any recurrence symptom | | **P2** | 12 hours | Reduced by 10% | Same failure signature | | **P3** | Next business day | Normal thresholds | Identical alert | | **P4** | None | Normal | Normal process | ## Decision Trees ### Incident Severity Assignment ``` Event detected or reported ├── Is the service completely unavailable? │ ├── Yes → Is it a revenue-generating or safety-critical service? │ │ ├── Yes → P1 Critical │ │ └── No → P2 High │ └── No → Service is partially available │ ├── Are more than 10% of users affected without workaround? │ │ ├── Yes → P2 High │ │ └── No → Is there a workaround available? │ │ ├── Yes → P3 Medium │ │ └── No, but fewer than 10% of users → P3 Medium │ └── Is this a non-production or cosmetic issue? │ └── Yes → P4 Low ├── Is there confirmed or suspected data exposure? │ ├── Confirmed breach → P1 Critical (regardless of service status) │ └── Suspected exposure → P2 High minimum └── Has redundancy been lost on a critical path? ├── Yes, no failover remaining → P2 High └── Yes, failover still available → P3 Medium ``` ### Escalation Decision ``` Severity assigned ├── P1 or P2? │ ├── Yes → Assign all four roles immediately │ │ ├── Is vendor equipment involved in the failure? │ │ │ ├── Yes → Open vendor TAC case immediately │ │ │ └── No → Internal investigation first │ │ └── Has root cause been identified within time window? │ │ ├── P1: not identified within 30 min → Escalate to next tier │ │ └── P2: not identified within 2 hr → Escalate to next tier │ └── P3 or P4? │ ├── P3 → Assign Technical Lead, monitor for escalation │ │ └── Impact worsening? → Re-classify severity upward │ └── P4 → Normal ticket queue, no escalation └── At any point: if scope expands beyond initial classification └── Re-evaluate severity from Step 1, escalate if needed ``` ## Report Template ``` INCIDENT REPORT ===================================== Incident ID: [ticket/tracking number] Severity: [P1/P2/P3/P4] Incident Commander: [name] Duration: [detection time] — [resolution time] ([total hours]) Status: [Resolved / Monitoring / Under Review] IMPACT SUMMARY: Users Affected: [count or percentage] Services Affected: [list of impacted services] Revenue Impact: [estimated or confirmed] Data Impact: [none / suspected / confirmed — description] TIMELINE OF EVENTS: | # | Time (UTC) | Event | Actor | Notes | |---|-----------|-------|-------|-------| | 1 | [time] | [event description] | [person/system] | [context] | ROOT CAUSE: Category: [Process / People / Technology] Root Cause: [description from 5-whys analysis] Contributing Factors: - [factor 1 — category] - [factor 2 — category] RESOLUTION: Fix Applied: [description of what resolved the incident] Validated By: [how resolution was confirmed] Back-out Available: [yes/no — description] METRICS: MTTD: [time from occurrence to detection] MTTI: [time from detection to root cause identified] MTTR: [time from detection to resolution] Recurrence: [yes/no — reference to prior incident if yes] ACTION ITEMS: | # | Action | Type | Owner | Due Date | Status | |---|--------|------|-------|----------|--------| | 1 | [action] | [Fix/Mitigate/Accept/Transfer] | [name] | [date] | [status] | POST-MORTEM STATUS: Scheduled: [date/time or "pending"] Attendees: [roles invited] Document Location: [link to post-mortem document] ``` ## Troubleshooting ### Severity Disagreement Between Teams **Symptom:** Teams classify the same incident at different severity levels, causing confusion about response urgency. **Resolution:** The IC makes the final determination using the Threshold Tables criteria. The highest applicable severity governs. Document rationale in the ticket. If the IC is not yet assigned, the first responder sets initial severity and the IC may adjust. ### Escalation Fatigue and Alert Noise **Symptom:** Frequent P1/P2 declarations for issues that resolve quickly, eroding trust in severity classification. **Resolution:** Review severity criteria quarterly. Track the false-positive rate (incidents downgraded after initial classification). If P1 downgrade rate exceeds 30%, tighten P1 criteria. Ensure P3/P4 incidents are not over-classified. ### Post-Mortem Action Items Not Completed **Symptom:** Action items accumulate but are not completed, leading to recurring incidents from known causes. **Resolution:** Assign every action item an owner and due date at the review. Track completion in the incident system, not separate documents. Review open items in weekly standups. Escalate overdue items to management and report completion rates alongside MTTD/MTTR. ### Communication Gaps During Extended Incidents **Symptom:** Status updates become infrequent during long incidents (>4 hours), leaving stakeholders uninformed. **Resolution:** The Communications Lead maintains cadence regardless of investigation progress. If no new findings exist, state that explicitly in the update. For incidents exceeding 8 hours, rotate the Comms Lead role to prevent fatigue. ### Incident Recurrence After Resolution **Symptom:** The same incident recurs after being marked resolved. **Resolution:** Check whether prior post-mortem action items were completed. If yes, the root cause analysis was incomplete — reconvene with broader scope. If not, escalate the completion failure. Tag the new incident as a recurrence and increase severity by one level to reflect accumulated impact.

incident-response-lifecycle

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

incident-response-lifecycle