gabriel / musehub public
secret-rotation-runbook.md markdown
214 lines 6.2 KB
Raw
sha256:3ff9c9863a9891bdcde71b4a43228f66d0493e38b7cc1d09fe9eb7de774046b2 feat: add repair-commit wire endpoint (API parity with repa… Opus 4.8 minor ⚠ breaking 1 day ago

Secret Rotation Runbook

Last updated: 2026-04-05

Overview

All secrets are stored in AWS SSM Parameter Store (/musehub/<env>/) as SecureString values (AES-256, KMS-encrypted). deploy/secrets.sh fetches them at deploy time and writes /opt/musehub/.env.

Secret Location Rotation schedule Impact of compromise
DB_PASSWORD SSM /musehub/<env>/DB_PASSWORD Every 180 days Full DB read/write access
WEBHOOK_SECRET_KEY SSM /musehub/<env>/WEBHOOK_SECRET_KEY On compromise Webhook HMAC spoofing
RUNNER_TOKEN SSM /musehub/<env>/RUNNER_TOKEN Every 90 days CI job injection
R2_ACCESS_KEY_ID / R2_SECRET_ACCESS_KEY SSM /musehub/<env>/R2_* Every 90 days Object store read/write

1. DB_PASSWORD rotation

Schedule: Every 180 days (calendar reminder).

# 1. Generate new password
NEW_PW=$(openssl rand -hex 24)

# 2. Update Postgres (zero-downtime — Postgres accepts multiple passwords via
#    ALTER ROLE ... PASSWORD during transition)
ssh -i ~/.ssh/musehub-key.pem [email protected] \
  "sudo docker exec musehub_postgres psql -U musehub -d musehub \
   -c \"ALTER USER musehub PASSWORD '$NEW_PW';\""

# 3. Update SSM (both prod and staging if applicable)
aws ssm put-parameter \
    --name /musehub/production/DB_PASSWORD \
    --value "$NEW_PW" \
    --type SecureString \
    --overwrite \
    --region us-east-1

# 4. Re-deploy to pick up the new .env
ssh -i ~/.ssh/musehub-key.pem [email protected] \
  "cd /opt/musehub && bash deploy/secrets.sh && bash deploy/deploy.sh"

# 5. Verify the new containers connect successfully
ssh -i ~/.ssh/musehub-key.pem [email protected] \
  "sudo docker ps && curl -sf https://localhost:1337/explore > /dev/null && echo OK"

2. WEBHOOK_SECRET_KEY rotation

Schedule: Rotate immediately on any suspected compromise.

The key is a Fernet key (AES-128-CBC + HMAC-SHA256). Rotating it invalidates all existing webhook HMAC signatures — users will need to re-register their webhook endpoints after rotation.

# 1. Generate a new Fernet key
NEW_KEY=$(python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")

# 2. Update SSM
aws ssm put-parameter \
    --name /musehub/production/WEBHOOK_SECRET_KEY \
    --value "$NEW_KEY" \
    --type SecureString \
    --overwrite \
    --region us-east-1

# 3. Re-deploy
ssh -i ~/.ssh/musehub-key.pem [email protected] \
  "cd /opt/musehub && bash deploy/secrets.sh && bash deploy/deploy.sh"

# 4. Notify affected users (webhook deliveries will fail until they re-register)

3. RUNNER_TOKEN rotation

Schedule: Every 90 days.

The runner token is a shared secret between MuseHub and the musehub-runner container. Both must be updated atomically to avoid a brief 503 window.

# 1. Generate new token
NEW_TOKEN=$(openssl rand -hex 32)

# 2. Update SSM
aws ssm put-parameter \
    --name /musehub/production/RUNNER_TOKEN \
    --value "$NEW_TOKEN" \
    --type SecureString \
    --overwrite \
    --region us-east-1

# 3. Re-deploy (deploy.sh restarts both musehub and musehub-runner)
ssh -i ~/.ssh/musehub-key.pem [email protected] \
  "cd /opt/musehub && bash deploy/secrets.sh && bash deploy/deploy.sh"

4. R2 credentials rotation

Schedule: Every 90 days via Cloudflare dashboard.

# 1. Create new R2 API token in Cloudflare dashboard
#    (R2 → Manage API tokens → Create token)

# 2. Update SSM
aws ssm put-parameter \
    --name /musehub/production/R2_ACCESS_KEY_ID \
    --value "$NEW_KEY_ID" \
    --type SecureString \
    --overwrite \
    --region us-east-1

aws ssm put-parameter \
    --name /musehub/production/R2_SECRET_ACCESS_KEY \
    --value "$NEW_SECRET" \
    --type SecureString \
    --overwrite \
    --region us-east-1

# 3. Re-deploy
ssh -i ~/.ssh/musehub-key.pem [email protected] \
  "cd /opt/musehub && bash deploy/secrets.sh && bash deploy/deploy.sh"

# 4. Revoke old token in Cloudflare dashboard

5. Docker image layer audit

Run after every production build to confirm no secrets are baked in:

# Pull the current image name from docker compose
IMAGE=$(sudo docker compose images musehub --format json | python3 -c \
  "import sys,json; d=json.load(sys.stdin); print(d[0]['Image'])")

# Audit every layer's creation command
sudo docker history --no-trunc "$IMAGE" | grep -iE \
  "password|secret|token|key|credential" \
&& echo "FAIL — secrets found in image layers" \
|| echo "OK — no secrets in image layers"

# Also check env vars baked into the image
sudo docker inspect "$IMAGE" | python3 -c "
import sys, json
for img in json.load(sys.stdin):
    for env in img.get('Config', {}).get('Env', []):
        key = env.split('=', 1)[0].upper()
        bad = any(w in key for w in ['PASSWORD','SECRET','TOKEN','KEY','CREDENTIAL'])
        if bad:
            print(f'WARN: {env.split(\"=\", 1)[0]!r} is set in image ENV')
"

Expected output: only PYTHONPATH, PYTHONDONTWRITEBYTECODE, PYTHONUNBUFFERED.


6. First-time SSM bootstrap

Run once to populate SSM for a new environment:

ENV=production   # or staging

# DB password
aws ssm put-parameter \
    --name /musehub/$ENV/DB_PASSWORD \
    --value "$(openssl rand -hex 24)" \
    --type SecureString \
    --region us-east-1

# Webhook key
aws ssm put-parameter \
    --name /musehub/$ENV/WEBHOOK_SECRET_KEY \
    --value "$(python3 -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())')" \
    --type SecureString \
    --region us-east-1

# Runner token
aws ssm put-parameter \
    --name /musehub/$ENV/RUNNER_TOKEN \
    --value "$(openssl rand -hex 32)" \
    --type SecureString \
    --region us-east-1

7. Compromise response

If any secret is suspected compromised:

  1. Rotate immediately using the relevant procedure above.
  2. Audit CloudTrail for unexpected ssm:GetParameter calls:
    AWS Console → CloudTrail → Event history → Filter by Event name: GetParameter
    
  3. Revoke old value in the issuing system (Cloudflare, Postgres, etc.).
  4. Review application logs for anomalous authenticated requests in the window between suspected exposure and rotation.
  5. Document the incident in the team's incident log.
File History 1 commit
sha256:3ff9c9863a9891bdcde71b4a43228f66d0493e38b7cc1d09fe9eb7de774046b2 feat: add repair-commit wire endpoint (API parity with repa… Opus 4.8 minor 1 day ago