# MuseHub On-Call Runbook **On-call contact:** gabriel **Alert channel:** SNS `musehub-alerts` → SMS to gabriel's phone + email **Alert script:** `deploy/cloudwatch-alerts.sh` --- ## Alert inventory | Alarm name | Threshold | Action | |---|---|---| | `musehub-5xx-rate-high` | 5xx rate > 1% over 2 consecutive 1-min windows | Investigate app logs + DB | | `musehub-p99-latency-high` | p99 latency > 2 s over 2/3 1-min windows | Check slow queries, CPU, memory | | `musehub-disk-high` | Disk > 80% | Prune object store; archive or expand volume | | `musehub-db-connections-high` | DB connections > 90% of `max_connections` | Restart connection pool; scale PgBouncer | --- ## Setting up SMS alerting ```bash # Provision alerts (run once; re-runnable) ALERT_EMAIL="gabriel@musehub.ai" \ ALERT_PHONE="+1XXXXXXXXXX" \ AWS_REGION=eu-west-1 \ INSTANCE_ID=i-XXXXXXXXXXXXXXXXX \ bash deploy/cloudwatch-alerts.sh ``` AWS will send a confirmation email to `ALERT_EMAIL` — click the link to activate. SMS subscription is immediate (no confirmation required for US numbers). --- ## PagerDuty integration (optional escalation) If gabriel is unreachable after 15 minutes: 1. Create a PagerDuty service with an "Amazon SNS" integration. 2. Copy the PagerDuty HTTPS endpoint. 3. Subscribe it to the SNS topic: ```bash aws sns subscribe \ --topic-arn arn:aws:sns:eu-west-1:ACCOUNT:musehub-alerts \ --protocol https \ --notification-endpoint https://events.pagerduty.com/integration/XXXXXX/enqueue ``` 4. Set an escalation policy: page gabriel's mobile → page backup → open incident. --- ## Log access Logs ship to CloudWatch Logs group `/musehub/app` (30-day hot retention). ```bash # Tail live logs aws logs tail /musehub/app --follow --region eu-west-1 # Search for 5xx errors in the last hour aws logs filter-log-events \ --log-group-name /musehub/app \ --start-time $(date -d '1 hour ago' +%s)000 \ --filter-pattern '{ $.status >= 500 }' \ --region eu-west-1 # Search by request_id aws logs filter-log-events \ --log-group-name /musehub/app \ --filter-pattern '{ $.request_id = "REQUEST_ID_HERE" }' \ --region eu-west-1 # Search by user_id aws logs filter-log-events \ --log-group-name /musehub/app \ --filter-pattern '{ $.user_id = "gabriel" }' \ --region eu-west-1 ``` --- ## Cold log storage (1-year retention) Hot logs (30 days) live in CloudWatch Logs. Cold logs are exported to S3: - **Export job** (run monthly or set up Kinesis Firehose subscription): ```bash aws logs create-export-task \ --log-group-name /musehub/app \ --from $(date -d '31 days ago' +%s)000 \ --to $(date -d '1 day ago' +%s)000 \ --destination musehub-logs-cold \ --destination-prefix musehub/app \ --region eu-west-1 ``` - **S3 lifecycle policy** on `musehub-logs-cold` bucket: - Transition to Glacier after 30 days - Expire (delete) after 365 days --- ## Common incidents ### High 5xx rate 1. Check recent deployments: `muse -C ~/musehub log --oneline -20` 2. Find 5xx requests: filter logs for `$.status >= 500` 3. Look at `exc_info` field for stack traces 4. If DB-related: check connections alarm + `select count(*) from pg_stat_activity` 5. Rollback: `bash deploy/deploy.sh` deploys the image that is already built — ensure the previous image tag is still present on the host before rolling back ### High p99 latency 1. Filter logs for `$.duration_ms > 2000` to find slow endpoints 2. Check slow query log (threshold: 100ms, configured in `Settings.slow_query_threshold_ms`) 3. Check RSS: `docker stats musehub --no-stream` 4. Check DB connections: `select count(*), state from pg_stat_activity group by state` ### Disk > 80% 1. Identify largest directories: `du -sh /data/musehub/objects/*/ | sort -rh | head -20` 2. Check for objects past soft-delete window: `select count(*) from musehub_objects where deleted_at < now() - interval '30 days'` 3. Run hard-delete job manually if needed 4. If disk keeps growing: expand the EBS volume via AWS Console + `sudo growpart` ### DB connections > 90% 1. `select count(*), state, wait_event_type from pg_stat_activity group by state, wait_event_type` 2. Kill idle connections: `select pg_terminate_backend(pid) from pg_stat_activity where state = 'idle' and query_start < now() - interval '5 minutes'` 3. Restart the musehub container to flush its async pool: `sudo docker restart musehub-blue` (or `-green`) 4. If recurrent: decrease `db_pool_timeout` or add PgBouncer in front of PostgreSQL