--- title: Production incident response — SOP project: business-ops-template tags: - playbook - incident - sre - on-call date: 2026-04-07 --- # Production incident response — SOP **Applies to:** Customer-facing production services | **Owner:** Platform lead ## Severity guide (use one) - **SEV1:** Full outage or data loss risk; wake secondary on-call. - **SEV2:** Major degradation; work business hours unless revenue-critical. - **SEV3:** Minor issue with workaround; ticket and batch fix. ## Immediate steps (first 15 minutes) 1. **Declare incident** in the status tool; set severity; post customer-facing banner only if user-visible. 2. Assign **Incident Commander (IC)** and **scribe**; IC drives, scribe timestamps actions. 3. Capture **symptoms**: error rates, regions, last deploy, dependency status—link dashboards in the incident doc. ## Stabilize before root cause - Prefer **rollback** or feature flag off if change correlated; avoid speculative hotfixes during SEV1. ## Communication - Internal updates every **30 minutes** for SEV1 until resolved. - Customer comms go through **support lead**; no individual engineer posts externally. ## After resolution & escalation Post **summary** (window, cause, remediation, follow-ups in 48h); book a **blameless retro** within five business days with owned actions. If IC is absent, secondary runs IC; loop legal/comms on **data exposure** suspicion. Replace tool names and contacts for your org.