on-call-runbook.md markdown

133 lines 4.4 KB

sha256:3ff9c9863a9891bdcde71b4a43228f66d0493e38b7cc1d09fe9eb7de774046b2 feat: add repair-commit wire endpoint (API parity with repa… Opus 4.8 minor ⚠ breaking 1 day ago

MuseHub On-Call Runbook

On-call contact: gabriel
Alert channel: SNS musehub-alerts → SMS to gabriel's phone + email
Alert script: deploy/cloudwatch-alerts.sh

Alert inventory

Alarm name	Threshold	Action
`musehub-5xx-rate-high`	5xx rate > 1% over 2 consecutive 1-min windows	Investigate app logs + DB
`musehub-p99-latency-high`	p99 latency > 2 s over 2/3 1-min windows	Check slow queries, CPU, memory
`musehub-disk-high`	Disk > 80%	Prune object store; archive or expand volume
`musehub-db-connections-high`	DB connections > 90% of `max_connections`	Restart connection pool; scale PgBouncer

Setting up SMS alerting

# Provision alerts (run once; re-runnable)
ALERT_EMAIL="[email protected]" \
ALERT_PHONE="+1XXXXXXXXXX" \
AWS_REGION=eu-west-1 \
INSTANCE_ID=i-XXXXXXXXXXXXXXXXX \
bash deploy/cloudwatch-alerts.sh

AWS will send a confirmation email to ALERT_EMAIL — click the link to activate.
SMS subscription is immediate (no confirmation required for US numbers).

PagerDuty integration (optional escalation)

If gabriel is unreachable after 15 minutes:

Create a PagerDuty service with an "Amazon SNS" integration.
Copy the PagerDuty HTTPS endpoint.

Subscribe it to the SNS topic:

aws sns subscribe \
  --topic-arn arn:aws:sns:eu-west-1:ACCOUNT:musehub-alerts \
  --protocol https \
  --notification-endpoint https://events.pagerduty.com/integration/XXXXXX/enqueue

Set an escalation policy: page gabriel's mobile → page backup → open incident.

Log access

Logs ship to CloudWatch Logs group /musehub/app (30-day hot retention).

# Tail live logs
aws logs tail /musehub/app --follow --region eu-west-1

# Search for 5xx errors in the last hour
aws logs filter-log-events \
  --log-group-name /musehub/app \
  --start-time $(date -d '1 hour ago' +%s)000 \
  --filter-pattern '{ $.status >= 500 }' \
  --region eu-west-1

# Search by request_id
aws logs filter-log-events \
  --log-group-name /musehub/app \
  --filter-pattern '{ $.request_id = "REQUEST_ID_HERE" }' \
  --region eu-west-1

# Search by user_id
aws logs filter-log-events \
  --log-group-name /musehub/app \
  --filter-pattern '{ $.user_id = "gabriel" }' \
  --region eu-west-1

Cold log storage (1-year retention)

Hot logs (30 days) live in CloudWatch Logs. Cold logs are exported to S3:

Export job (run monthly or set up Kinesis Firehose subscription):

aws logs create-export-task \
  --log-group-name /musehub/app \
  --from $(date -d '31 days ago' +%s)000 \
  --to $(date -d '1 day ago' +%s)000 \
  --destination musehub-logs-cold \
  --destination-prefix musehub/app \
  --region eu-west-1

S3 lifecycle policy on musehub-logs-cold bucket:
- Transition to Glacier after 30 days
- Expire (delete) after 365 days

Common incidents

High 5xx rate

Check recent deployments: muse -C ~/musehub log --oneline -20
Find 5xx requests: filter logs for $.status >= 500
Look at exc_info field for stack traces
If DB-related: check connections alarm + select count(*) from pg_stat_activity
Rollback: bash deploy/deploy.sh deploys the image that is already built — ensure the previous image tag is still present on the host before rolling back

High p99 latency

Filter logs for $.duration_ms > 2000 to find slow endpoints
Check slow query log (threshold: 100ms, configured in Settings.slow_query_threshold_ms)
Check RSS: docker stats musehub --no-stream
Check DB connections: select count(*), state from pg_stat_activity group by state

Disk > 80%

Identify largest directories: du -sh /data/musehub/objects/*/ | sort -rh | head -20
Check for objects past soft-delete window: select count(*) from musehub_objects where deleted_at < now() - interval '30 days'
Run hard-delete job manually if needed
If disk keeps growing: expand the EBS volume via AWS Console + sudo growpart

DB connections > 90%

select count(*), state, wait_event_type from pg_stat_activity group by state, wait_event_type
Kill idle connections: select pg_terminate_backend(pid) from pg_stat_activity where state = 'idle' and query_start < now() - interval '5 minutes'
Restart the musehub container to flush its async pool: sudo docker restart musehub-blue (or -green)
If recurrent: decrease db_pool_timeout or add PgBouncer in front of PostgreSQL

File History 1 commit

sha256:3ff9c9863a9891bdcde71b4a43228f66d0493e38b7cc1d09fe9eb7de774046b2 feat: add repair-commit wire endpoint (API parity with repa… Opus 4.8 minor ⚠ 1 day ago