# MuseHub On-Call Runbook

**On-call contact:** gabriel  
**Alert channel:** SNS `musehub-alerts` → SMS to gabriel's phone + email  
**Alert script:** `deploy/cloudwatch-alerts.sh`

---

## Alert inventory

| Alarm name | Threshold | Action |
|---|---|---|
| `musehub-5xx-rate-high` | 5xx rate > 1% over 2 consecutive 1-min windows | Investigate app logs + DB |
| `musehub-p99-latency-high` | p99 latency > 2 s over 2/3 1-min windows | Check slow queries, CPU, memory |
| `musehub-disk-high` | Disk > 80% | Prune object store; archive or expand volume |
| `musehub-db-connections-high` | DB connections > 90% of `max_connections` | Restart connection pool; scale PgBouncer |

---

## Setting up SMS alerting

```bash
# Provision alerts (run once; re-runnable)
ALERT_EMAIL="gabriel@musehub.ai" \
ALERT_PHONE="+1XXXXXXXXXX" \
AWS_REGION=eu-west-1 \
INSTANCE_ID=i-XXXXXXXXXXXXXXXXX \
bash deploy/cloudwatch-alerts.sh
```

AWS will send a confirmation email to `ALERT_EMAIL` — click the link to activate.  
SMS subscription is immediate (no confirmation required for US numbers).

---

## PagerDuty integration (optional escalation)

If gabriel is unreachable after 15 minutes:

1. Create a PagerDuty service with an "Amazon SNS" integration.
2. Copy the PagerDuty HTTPS endpoint.
3. Subscribe it to the SNS topic:
   ```bash
   aws sns subscribe \
     --topic-arn arn:aws:sns:eu-west-1:ACCOUNT:musehub-alerts \
     --protocol https \
     --notification-endpoint https://events.pagerduty.com/integration/XXXXXX/enqueue
   ```
4. Set an escalation policy: page gabriel's mobile → page backup → open incident.

---

## Log access

Logs ship to CloudWatch Logs group `/musehub/app` (30-day hot retention).

```bash
# Tail live logs
aws logs tail /musehub/app --follow --region eu-west-1

# Search for 5xx errors in the last hour
aws logs filter-log-events \
  --log-group-name /musehub/app \
  --start-time $(date -d '1 hour ago' +%s)000 \
  --filter-pattern '{ $.status >= 500 }' \
  --region eu-west-1

# Search by request_id
aws logs filter-log-events \
  --log-group-name /musehub/app \
  --filter-pattern '{ $.request_id = "REQUEST_ID_HERE" }' \
  --region eu-west-1

# Search by user_id
aws logs filter-log-events \
  --log-group-name /musehub/app \
  --filter-pattern '{ $.user_id = "gabriel" }' \
  --region eu-west-1
```

---

## Cold log storage (1-year retention)

Hot logs (30 days) live in CloudWatch Logs.  Cold logs are exported to S3:

- **Export job** (run monthly or set up Kinesis Firehose subscription):
  ```bash
  aws logs create-export-task \
    --log-group-name /musehub/app \
    --from $(date -d '31 days ago' +%s)000 \
    --to $(date -d '1 day ago' +%s)000 \
    --destination musehub-logs-cold \
    --destination-prefix musehub/app \
    --region eu-west-1
  ```

- **S3 lifecycle policy** on `musehub-logs-cold` bucket:
  - Transition to Glacier after 30 days
  - Expire (delete) after 365 days

---

## Common incidents

### High 5xx rate

1. Check recent deployments: `muse -C ~/musehub log --oneline -20`
2. Find 5xx requests: filter logs for `$.status >= 500`
3. Look at `exc_info` field for stack traces
4. If DB-related: check connections alarm + `select count(*) from pg_stat_activity`
5. Rollback: `bash deploy/deploy.sh` deploys the image that is already built — ensure the previous image tag is still present on the host before rolling back

### High p99 latency

1. Filter logs for `$.duration_ms > 2000` to find slow endpoints
2. Check slow query log (threshold: 100ms, configured in `Settings.slow_query_threshold_ms`)
3. Check RSS: `docker stats musehub --no-stream`
4. Check DB connections: `select count(*), state from pg_stat_activity group by state`

### Disk > 80%

1. Identify largest directories: `du -sh /data/musehub/objects/*/ | sort -rh | head -20`
2. Check for objects past soft-delete window: `select count(*) from musehub_objects where deleted_at < now() - interval '30 days'`
3. Run hard-delete job manually if needed
4. If disk keeps growing: expand the EBS volume via AWS Console + `sudo growpart`

### DB connections > 90%

1. `select count(*), state, wait_event_type from pg_stat_activity group by state, wait_event_type`
2. Kill idle connections: `select pg_terminate_backend(pid) from pg_stat_activity where state = 'idle' and query_start < now() - interval '5 minutes'`
3. Restart the musehub container to flush its async pool: `sudo docker restart musehub-blue` (or `-green`)
4. If recurrent: decrease `db_pool_timeout` or add PgBouncer in front of PostgreSQL