on-call-runbook.md
markdown
sha256:3ff9c9863a9891bdcde71b4a43228f66d0493e38b7cc1d09fe9eb7de774046b2
feat: add repair-commit wire endpoint (API parity with repa…
Opus 4.8
minor
⚠ breaking
1 day ago
MuseHub On-Call Runbook
On-call contact: gabriel
Alert channel: SNS musehub-alerts → SMS to gabriel's phone + email
Alert script: deploy/cloudwatch-alerts.sh
Alert inventory
| Alarm name | Threshold | Action |
|---|---|---|
musehub-5xx-rate-high |
5xx rate > 1% over 2 consecutive 1-min windows | Investigate app logs + DB |
musehub-p99-latency-high |
p99 latency > 2 s over 2/3 1-min windows | Check slow queries, CPU, memory |
musehub-disk-high |
Disk > 80% | Prune object store; archive or expand volume |
musehub-db-connections-high |
DB connections > 90% of max_connections |
Restart connection pool; scale PgBouncer |
Setting up SMS alerting
# Provision alerts (run once; re-runnable)
ALERT_EMAIL="[email protected]" \
ALERT_PHONE="+1XXXXXXXXXX" \
AWS_REGION=eu-west-1 \
INSTANCE_ID=i-XXXXXXXXXXXXXXXXX \
bash deploy/cloudwatch-alerts.sh
AWS will send a confirmation email to ALERT_EMAIL — click the link to activate.
SMS subscription is immediate (no confirmation required for US numbers).
PagerDuty integration (optional escalation)
If gabriel is unreachable after 15 minutes:
- Create a PagerDuty service with an "Amazon SNS" integration.
- Copy the PagerDuty HTTPS endpoint.
- Subscribe it to the SNS topic:
aws sns subscribe \ --topic-arn arn:aws:sns:eu-west-1:ACCOUNT:musehub-alerts \ --protocol https \ --notification-endpoint https://events.pagerduty.com/integration/XXXXXX/enqueue - Set an escalation policy: page gabriel's mobile → page backup → open incident.
Log access
Logs ship to CloudWatch Logs group /musehub/app (30-day hot retention).
# Tail live logs
aws logs tail /musehub/app --follow --region eu-west-1
# Search for 5xx errors in the last hour
aws logs filter-log-events \
--log-group-name /musehub/app \
--start-time $(date -d '1 hour ago' +%s)000 \
--filter-pattern '{ $.status >= 500 }' \
--region eu-west-1
# Search by request_id
aws logs filter-log-events \
--log-group-name /musehub/app \
--filter-pattern '{ $.request_id = "REQUEST_ID_HERE" }' \
--region eu-west-1
# Search by user_id
aws logs filter-log-events \
--log-group-name /musehub/app \
--filter-pattern '{ $.user_id = "gabriel" }' \
--region eu-west-1
Cold log storage (1-year retention)
Hot logs (30 days) live in CloudWatch Logs. Cold logs are exported to S3:
Export job (run monthly or set up Kinesis Firehose subscription):
aws logs create-export-task \ --log-group-name /musehub/app \ --from $(date -d '31 days ago' +%s)000 \ --to $(date -d '1 day ago' +%s)000 \ --destination musehub-logs-cold \ --destination-prefix musehub/app \ --region eu-west-1S3 lifecycle policy on
musehub-logs-coldbucket:- Transition to Glacier after 30 days
- Expire (delete) after 365 days
Common incidents
High 5xx rate
- Check recent deployments:
muse -C ~/musehub log --oneline -20 - Find 5xx requests: filter logs for
$.status >= 500 - Look at
exc_infofield for stack traces - If DB-related: check connections alarm +
select count(*) from pg_stat_activity - Rollback:
bash deploy/deploy.shdeploys the image that is already built — ensure the previous image tag is still present on the host before rolling back
High p99 latency
- Filter logs for
$.duration_ms > 2000to find slow endpoints - Check slow query log (threshold: 100ms, configured in
Settings.slow_query_threshold_ms) - Check RSS:
docker stats musehub --no-stream - Check DB connections:
select count(*), state from pg_stat_activity group by state
Disk > 80%
- Identify largest directories:
du -sh /data/musehub/objects/*/ | sort -rh | head -20 - Check for objects past soft-delete window:
select count(*) from musehub_objects where deleted_at < now() - interval '30 days' - Run hard-delete job manually if needed
- If disk keeps growing: expand the EBS volume via AWS Console +
sudo growpart
DB connections > 90%
select count(*), state, wait_event_type from pg_stat_activity group by state, wait_event_type- Kill idle connections:
select pg_terminate_backend(pid) from pg_stat_activity where state = 'idle' and query_start < now() - interval '5 minutes' - Restart the musehub container to flush its async pool:
sudo docker restart musehub-blue(or-green) - If recurrent: decrease
db_pool_timeoutor add PgBouncer in front of PostgreSQL
File History
1 commit
sha256:3ff9c9863a9891bdcde71b4a43228f66d0493e38b7cc1d09fe9eb7de774046b2
feat: add repair-commit wire endpoint (API parity with repa…
Opus 4.8
minor
⚠
1 day ago