incident-response.md
markdown
sha256:fd47ab66017e55331b88ba3a59c34c23e4e05c5aec424251d3a404c5a7998c8e
feat(hub): restore integration tile detail modals; add Herm…
Human
minor
⚠ breaking
15 days ago
title: Production incident response — SOP project: business-ops-template tags:
- playbook
- incident
- sre
- on-call date: 2026-04-07
Production incident response — SOP
Applies to: Customer-facing production services | Owner: Platform lead
Severity guide (use one)
- SEV1: Full outage or data loss risk; wake secondary on-call.
- SEV2: Major degradation; work business hours unless revenue-critical.
- SEV3: Minor issue with workaround; ticket and batch fix.
Immediate steps (first 15 minutes)
- Declare incident in the status tool; set severity; post customer-facing banner only if user-visible.
- Assign Incident Commander (IC) and scribe; IC drives, scribe timestamps actions.
- Capture symptoms: error rates, regions, last deploy, dependency status—link dashboards in the incident doc.
Stabilize before root cause
- Prefer rollback or feature flag off if change correlated; avoid speculative hotfixes during SEV1.
Communication
- Internal updates every 30 minutes for SEV1 until resolved.
- Customer comms go through support lead; no individual engineer posts externally.
After resolution & escalation
Post summary (window, cause, remediation, follow-ups in 48h); book a blameless retro within five business days with owned actions. If IC is absent, secondary runs IC; loop legal/comms on data exposure suspicion. Replace tool names and contacts for your org.
File History
3 commits
sha256:fd47ab66017e55331b88ba3a59c34c23e4e05c5aec424251d3a404c5a7998c8e
feat(hub): restore integration tile detail modals; add Herm…
Human
minor
⚠
15 days ago
sha256:2827ba9e7632a4b141c50caf1e8f7d77abbc3515be20e7465f2bccb0ac4edf91
fix: repair endpoint now sets has_active_subscription when …
Human
minor
⚠
15 days ago
sha256:6a102aafafdfe7e70a24f4e59740200f0ee713ce7915f1b53e9d4ba5ee8b4410
Initial Muse snapshot
Human
47 days ago