gabriel / musehub public
on-call-runbook.md markdown
133 lines 4.4 KB
Raw
sha256:3ff9c9863a9891bdcde71b4a43228f66d0493e38b7cc1d09fe9eb7de774046b2 feat: add repair-commit wire endpoint (API parity with repa… Opus 4.8 minor ⚠ breaking 1 day ago

MuseHub On-Call Runbook

On-call contact: gabriel
Alert channel: SNS musehub-alerts → SMS to gabriel's phone + email
Alert script: deploy/cloudwatch-alerts.sh


Alert inventory

Alarm name Threshold Action
musehub-5xx-rate-high 5xx rate > 1% over 2 consecutive 1-min windows Investigate app logs + DB
musehub-p99-latency-high p99 latency > 2 s over 2/3 1-min windows Check slow queries, CPU, memory
musehub-disk-high Disk > 80% Prune object store; archive or expand volume
musehub-db-connections-high DB connections > 90% of max_connections Restart connection pool; scale PgBouncer

Setting up SMS alerting

# Provision alerts (run once; re-runnable)
ALERT_EMAIL="[email protected]" \
ALERT_PHONE="+1XXXXXXXXXX" \
AWS_REGION=eu-west-1 \
INSTANCE_ID=i-XXXXXXXXXXXXXXXXX \
bash deploy/cloudwatch-alerts.sh

AWS will send a confirmation email to ALERT_EMAIL — click the link to activate.
SMS subscription is immediate (no confirmation required for US numbers).


PagerDuty integration (optional escalation)

If gabriel is unreachable after 15 minutes:

  1. Create a PagerDuty service with an "Amazon SNS" integration.
  2. Copy the PagerDuty HTTPS endpoint.
  3. Subscribe it to the SNS topic:
    aws sns subscribe \
      --topic-arn arn:aws:sns:eu-west-1:ACCOUNT:musehub-alerts \
      --protocol https \
      --notification-endpoint https://events.pagerduty.com/integration/XXXXXX/enqueue
    
  4. Set an escalation policy: page gabriel's mobile → page backup → open incident.

Log access

Logs ship to CloudWatch Logs group /musehub/app (30-day hot retention).

# Tail live logs
aws logs tail /musehub/app --follow --region eu-west-1

# Search for 5xx errors in the last hour
aws logs filter-log-events \
  --log-group-name /musehub/app \
  --start-time $(date -d '1 hour ago' +%s)000 \
  --filter-pattern '{ $.status >= 500 }' \
  --region eu-west-1

# Search by request_id
aws logs filter-log-events \
  --log-group-name /musehub/app \
  --filter-pattern '{ $.request_id = "REQUEST_ID_HERE" }' \
  --region eu-west-1

# Search by user_id
aws logs filter-log-events \
  --log-group-name /musehub/app \
  --filter-pattern '{ $.user_id = "gabriel" }' \
  --region eu-west-1

Cold log storage (1-year retention)

Hot logs (30 days) live in CloudWatch Logs. Cold logs are exported to S3:

  • Export job (run monthly or set up Kinesis Firehose subscription):

    aws logs create-export-task \
      --log-group-name /musehub/app \
      --from $(date -d '31 days ago' +%s)000 \
      --to $(date -d '1 day ago' +%s)000 \
      --destination musehub-logs-cold \
      --destination-prefix musehub/app \
      --region eu-west-1
    
  • S3 lifecycle policy on musehub-logs-cold bucket:

    • Transition to Glacier after 30 days
    • Expire (delete) after 365 days

Common incidents

High 5xx rate

  1. Check recent deployments: muse -C ~/musehub log --oneline -20
  2. Find 5xx requests: filter logs for $.status >= 500
  3. Look at exc_info field for stack traces
  4. If DB-related: check connections alarm + select count(*) from pg_stat_activity
  5. Rollback: bash deploy/deploy.sh deploys the image that is already built — ensure the previous image tag is still present on the host before rolling back

High p99 latency

  1. Filter logs for $.duration_ms > 2000 to find slow endpoints
  2. Check slow query log (threshold: 100ms, configured in Settings.slow_query_threshold_ms)
  3. Check RSS: docker stats musehub --no-stream
  4. Check DB connections: select count(*), state from pg_stat_activity group by state

Disk > 80%

  1. Identify largest directories: du -sh /data/musehub/objects/*/ | sort -rh | head -20
  2. Check for objects past soft-delete window: select count(*) from musehub_objects where deleted_at < now() - interval '30 days'
  3. Run hard-delete job manually if needed
  4. If disk keeps growing: expand the EBS volume via AWS Console + sudo growpart

DB connections > 90%

  1. select count(*), state, wait_event_type from pg_stat_activity group by state, wait_event_type
  2. Kill idle connections: select pg_terminate_backend(pid) from pg_stat_activity where state = 'idle' and query_start < now() - interval '5 minutes'
  3. Restart the musehub container to flush its async pool: sudo docker restart musehub-blue (or -green)
  4. If recurrent: decrease db_pool_timeout or add PgBouncer in front of PostgreSQL
File History 1 commit
sha256:3ff9c9863a9891bdcde71b4a43228f66d0493e38b7cc1d09fe9eb7de774046b2 feat: add repair-commit wire endpoint (API parity with repa… Opus 4.8 minor 1 day ago