infrastructure.md markdown

470 lines 15.2 KB

sha256:7d6dd8f4a89e2d1fef2d84f6e65feaff51385d382f466766b7f690a22ec18e32 fix: fall back to DB ancestry check when mpack-only fast-fo… Sonnet 4.6 patch 6 days ago

MuseHub Cloud Infrastructure

Last updated: 2026-04-08

Overview

MuseHub runs on AWS EC2 (us-east-1) behind nginx with Let's Encrypt TLS. The application stack is Docker Compose: musehub (uvicorn) + postgres:16 + musehub-runner. No managed RDS, no ECS, no load balancer — intentionally minimal for this stage.

Two environments:

Environment	Domain	Instance	Elastic IP	Deploy status
Production	`musehub.ai`	`i-0855d6efe7fa1a49d` (`musehub-prod`)	`98.89.99.211`	⚠️ Not yet integrated — no IAM instance profile attached, SSM agent unreachable. `push.sh prod` will fail until the `musehub-ec2-ssm` IAM role is associated with the instance.
Staging	`staging.musehub.ai`	`i-07547cd20bee2dea5` (`musehub-staging`)	`23.22.27.39`	✅ Active — blue/green deploys working via `push.sh staging`

Shared AWS Resources

Resource	Value
Region	`us-east-1`
AMI	`ami-0c7217cdde317cfec` (Ubuntu 22.04 LTS)
Instance type	`t3.small`
Security group	`sg-05815872537fcfe76` (`musehub-sg`)
ECR registry	`992382692655.dkr.ecr.us-east-1.amazonaws.com`
ECR repository	`musehub/musehub`
IAM deploy user	`musehub-infra` (ECR push + SSM send)
IAM instance role	`musehub-ec2-ssm` (ECR pull + SSM receive)

Security group inbound rules:

TCP 443 — HTTPS (Cloudflare IPs only, IPv4 + IPv6)

Port 22 (SSH) and port 80 (HTTP) are not open. All remote access is via AWS SSM Session Manager. Cloudflare SSL mode is Full (Strict) — Cloudflare terminates TLS at the edge using a Cloudflare-issued cert, then connects to the origin on port 443 using the Cloudflare Origin Certificate at /etc/ssl/cloudflare/origin.pem. Nginx never needs to listen on port 80. Instance access requires the musehub-infra AWS credentials (default profile in ~/.aws/credentials).

Production Environment

⚠️ Prod deploy not yet active. The instance has no IAM instance profile — the SSM agent cannot register, so push.sh prod fails with InvalidInstanceId. To fix: associate the musehub-ec2-ssm IAM role with i-0855d6efe7fa1a49d in the EC2 console (Actions → Security → Modify IAM Role), then verify with aws ssm describe-instance-information --filters Key=InstanceIds,Values=i-0855d6efe7fa1a49d.

Instance

Instance ID : i-0855d6efe7fa1a49d
Name        : musehub-prod
Elastic IP  : 98.89.99.211
App dir     : /opt/musehub

Namecheap DNS (musehub.ai)

Type	Host	Value	TTL
A Record	@	98.89.99.211	Automatic
A Record	www	98.89.99.211	Automatic

Stack

nginx (host, ports 80/443)
  └─ proxy_pass → 127.0.0.1:1337
       └─ musehub container (uvicorn, port 1337)
            └─ depends_on → postgres container (port 5432 internal)
  musehub-runner container (polls musehub API for CI jobs)

Volumes

Volume	Contents
`musehub_data`	Object store — all pushed repo objects
`postgres_data`	PostgreSQL data directory
`runner_workspace`	CI job working directories

Environment variables (.env on instance at /opt/musehub/.env)

DEBUG=false
DATABASE_URL=postgresql+asyncpg://musehub:<DB_PASSWORD>@postgres:5432/musehub
DB_PASSWORD=<generated at provision time>
CORS_ORIGINS=["https://musehub.ai", "https://www.musehub.ai"]
WEBHOOK_SECRET_KEY=<generated Fernet key>
MUSEHUB_ALLOWED_ORIGINS=["musehub.ai", "www.musehub.ai"]
RUNNER_TOKEN=<generated at provision time>

Nginx config

Final SSL config lives at /etc/nginx/sites-available/musehub on the instance. Reference copy: deploy/nginx-ssl.conf.

Key timeouts:

/push and /push/objects — 300 s (large repo push serialization)
Everything else — 60 s

SSL

Let's Encrypt via Certbot. Auto-renews via cron (certbot renew). Certificate lives at /etc/letsencrypt/live/musehub.ai/.

Instance access (SSM — no SSH)

# Open an interactive shell on the prod instance
aws ssm start-session --target i-0855d6efe7fa1a49d --region us-east-1

# Run a one-off command
aws ssm send-command \
  --instance-ids i-0855d6efe7fa1a49d \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["sudo docker ps"]' \
  --region us-east-1 \
  --query "Command.CommandId" --output text

Useful commands on the instance

Run via SSM (aws ssm start-session --target <instance-id> --region us-east-1). The active app slot is either musehub-blue (port 1337) or musehub-green (port 1338).

# Which slot is live?
cat /opt/musehub/.active-slot
cat /etc/nginx/musehub-active-port

# View running containers
sudo docker ps

# Tail live app logs (substitute blue/green as needed)
sudo docker logs -f musehub-blue
sudo docker logs -f musehub-green

# Quick health check
curl -s http://127.0.0.1:1337/healthz   # blue slot
curl -s http://127.0.0.1:1338/healthz   # green slot

# Run Alembic migrations manually (against the live DB)
SLOT=$(cat /opt/musehub/.active-slot)
DB_PASSWORD=$(grep ^DB_PASSWORD /opt/musehub/.env | cut -d= -f2)
sudo docker run --rm \
  --network musehub_musehub-internal \
  --env-file /opt/musehub/.env \
  -e "DATABASE_URL=postgresql+asyncpg://musehub:${DB_PASSWORD}@postgres:5432/musehub" \
  <ecr-image>:<tag> alembic upgrade head

# Postgres shell (postgres container started by docker compose for the DB)
sudo docker exec -it postgres psql -U musehub -d musehub

# View nginx status
sudo systemctl status nginx
sudo nginx -t

Staging Environment

Purpose

Full production mirror with a separate DB, separate object store, and separate domain. Used for smoke tests before every prod deploy. Never exposed to users.

Instance (provisioned by aws-provision-staging.sh)

Instance ID : i-07547cd20bee2dea5
Name        : musehub-staging
Elastic IP  : 23.22.27.39
App dir     : /opt/musehub
Domain      : staging.musehub.ai

Namecheap DNS (musehub.ai, Advanced DNS tab)

Type	Host	Value	TTL
A Record	staging	`23.22.27.39`	Automatic

Provisioning (one-time, run locally)

# 1. Provision EC2 + EIP
chmod +x deploy/aws-provision-staging.sh
./deploy/aws-provision-staging.sh
# Note the instance ID and Elastic IP printed at the end.

# 2. Add staging.musehub.ai A record on Namecheap (see above).
#    Wait for propagation (~5 min with Automatic TTL):
watch -n 10 "dig staging.musehub.ai +short"

# 3. Bootstrap the instance (installs AWS CLI, verifies ECR access)
bash deploy/bootstrap-instance.sh staging

# 4. Run setup script on the instance via SSM
aws ssm send-command \
  --instance-ids <instance-id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["chmod +x /opt/musehub/deploy/setup-ec2-staging.sh && /opt/musehub/deploy/setup-ec2-staging.sh"]' \
  --region us-east-1

# 5. Do the first deploy
bash deploy/push.sh staging

Recovering a down staging instance (522 / Bad Gateway)

Symptom: staging.musehub.ai returns Cloudflare 522 or Bad Gateway.

Root cause pattern: The container stopped (either manually or after a reboot) and --restart unless-stopped did not fire because the container was in a stopped (not crashed) state when the instance last rebooted.

Fix — one SSM command, no polling:

CMD_ID=$(aws ssm send-command \
  --region us-east-1 \
  --instance-ids i-07547cd20bee2dea5 \
  --document-name "AWS-RunShellScript" \
  --parameters '{"commands":["sudo docker start musehub-blue musehub-worker 2>&1 && sudo musehub-set-slot blue && echo done"]}' \
  --query "Command.CommandId" --output text)
echo "Command sent: $CMD_ID"
# Wait ~20s then check once:
sleep 20 && aws ssm get-command-invocation \
  --region us-east-1 \
  --command-id "$CMD_ID" \
  --instance-id i-07547cd20bee2dea5 \
  --query "[Status,StandardOutputContent]" --output text

Check staging.musehub.ai in the browser — it should be back.

Critical rules when using SSM to recover staging:

Never reboot to fix SSM Pending. A reboot stops containers that were manually started — --restart unless-stopped only auto-starts containers that were running (not stopped) at reboot time. Rebooting to fix SSM will take the site down and require a manual docker start anyway.
Never poll SSM in a loop. The shell until/while sleep pattern freezes the terminal and masks whether the command succeeded. Send the command, wait a fixed interval, fetch once.
SSM Pending ≠ SSM broken. The agent can show Online but queue commands as Pending for 10–30 seconds after a fresh start. Wait before concluding SSM is broken.
InProgress means it will complete. If a command shows InProgress it is executing on the instance — do not cancel or resend. Check back in 30s.

Ongoing code deploys to staging

# Standard — builds image locally, pushes to ECR, triggers blue-green on staging
bash deploy/push.sh staging

Publishing a new muse CLI release

The install.sh script (served at https://staging.musehub.ai/install.sh) downloads muse-{version}.tar.gz from /releases/. The version comes from musehub/protocol/version.py (MUSE_VERSION), which tracks the musehub package version.

To ship a new muse build:

# From ~/ecosystem/musehub — builds sdist, uploads to S3, SSMs to staging,
# cleans up old tarballs (keeps 3), and verifies the URL is live.
bash deploy/publish_muse_release.sh

What it does:

Builds muse-{version}.tar.gz from ~/ecosystem/muse
Uploads to s3://musehub-releases/muse-{version}.tar.gz
SSMs to staging to copy from S3 → /data/releases/ (Docker volume)
Deletes stale tarballs from S3 and the server (keeps the 3 newest)
Smoke-tests https://staging.musehub.ai/releases/muse-{version}.tar.gz

Note: SSH is blocked on the instance (port 443 only). All server commands go through AWS SSM (musehub-infra IAM user). The staging instance (i-07547cd20bee2dea5) has the required IAM instance profile; no other instance does.

To test the install script end-to-end locally:

curl -fsSL https://staging.musehub.ai/install.sh | sh
# verify
~/.local/bin/muse --version
# cleanup
rm -rf ~/.local/share/muse/venv && rm -f ~/.local/bin/muse

Instance access (SSM — no SSH)

# Interactive shell on staging
aws ssm start-session --target i-07547cd20bee2dea5 --region us-east-1

Deployment Workflow

Deploys are image-based via ECR. No SSH, no rsync, no code on the instance after provisioning. All deploy commands run from the local ~/ecosystem/musehub directory.

Deploy pipeline overview

Local machine (push.sh):
  1. docker build (linux/amd64)
  2. docker save → tar, crane push → ECR (musehub/musehub:<tag>)
  3. aws ssm send-command → sync deploy.sh, then run it

Instance (deploy.sh via SSM):
  4. deploy.sh written from local copy (always current — never stale)
  5. aws ecr get-login-password | docker login
  6. docker pull <ecr>:<tag>
  7. docker run (migrations only, then exit)
  8. docker run -d (new slot — blue or green)
  9. curl /healthz until healthy
  10. nginx -s reload (zero-downtime flip)
  11. docker rm (old slot)

Key invariant: push.sh always writes the current local deploy.sh to the instance via SSM before running it. This means the instance's deploy.sh is always in sync with the local repo — there is no separate "sync the deploy scripts" step.

ECR Push — Use crane (not docker push)

docker push to ECR routes through Docker Desktop's VPNKit proxy (http.docker.internal:3128 / 192.168.65.1:3128 from inside the VM). After a local IP change or a Docker Desktop restart, the VPNKit proxy drops connections mid-upload on large layer pushes, producing broken pipe errors. The fix is crane — Google's container registry tool — which pushes images directly from the macOS host network, bypassing the Docker Desktop VM layer and its proxy entirely.

crane is the standard push method. Never use docker push to ECR.

Install once:

brew install crane

push.sh calls crane internally. If pushing manually outside the script:

# 1. Build the image locally (linux/amd64 target)
docker build --platform linux/amd64 -t musehub/musehub:latest .

# 2. Save to a tar archive on the host
docker save musehub/musehub:latest -o /tmp/musehub-latest.tar

# 3. Authenticate crane against ECR
aws ecr get-login-password --region us-east-1 \
  | crane auth login 992382692655.dkr.ecr.us-east-1.amazonaws.com \
      --username AWS --password-stdin

# 4. Push with crane (runs entirely on the macOS host — no VPNKit involved)
crane push /tmp/musehub-latest.tar \
  992382692655.dkr.ecr.us-east-1.amazonaws.com/musehub/musehub:latest

Standard deploy

# Deploy to staging (always first)
bash deploy/push.sh staging

# Deploy to prod after staging smoke test
bash deploy/push.sh prod

# Or both in sequence
bash deploy/push.sh staging prod

Rollback

# List recent ECR image tags
aws ecr describe-images \
  --repository-name musehub/musehub \
  --region us-east-1 \
  --query 'sort_by(imageDetails,&imagePushedAt)[-10:].imageTags[0]' \
  --output table

# Redeploy a specific tag (skips build+push)
IMAGE_TAG=<previous-tag> bash deploy/push.sh staging
IMAGE_TAG=<previous-tag> bash deploy/push.sh prod

Emergency migration rollback (on instance via SSM)

aws ssm send-command \
  --instance-ids i-0855d6efe7fa1a49d \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["cd /opt/musehub && sudo docker run --rm --network musehub_musehub-internal --env-file .env <ecr-image>:<tag> alembic downgrade -1"]' \
  --region us-east-1

Backups

No automated backup is configured yet. Planned:

Daily pg_dump compressed to S3 (or a second EBS snapshot)
Volume snapshot via AWS before every production deploy
Object store (musehub_data) is content-addressed — safe to snapshot at any time

Until automated backups are set up, take a manual snapshot before every prod deploy:

# On prod instance
sudo docker compose exec postgres pg_dump -U musehub musehub | gzip > ~/musehub-backup-$(date +%Y%m%d).sql.gz

Costs (approximate, us-east-1, 2025 pricing)

Item	$/month
t3.small (prod)	~$15
t3.small (staging)	~$15 (stop when not in use to reduce cost)
Elastic IPs (2)	~$0 while associated, $3.60/mo each if unassociated
EBS gp3 20 GB (each)	~$1.60
Total (both running)	~$35/mo

To pause staging when not needed:

aws ec2 stop-instances --region us-east-1 --instance-ids <STAGING_INSTANCE_ID>
# Start again with:
aws ec2 start-instances --region us-east-1 --instance-ids <STAGING_INSTANCE_ID>

The Elastic IP stays associated while the instance is stopped — no charge.

Secrets inventory

All secrets live in /opt/musehub/.env on each instance. Never committed to source.

Secret	How generated	Rotation
`DB_PASSWORD`	`openssl rand -hex 16`	Manual, on compromise
`WEBHOOK_SECRET_KEY`	Fernet key	Manual, on compromise
`RUNNER_TOKEN`	`openssl rand -hex 32`	Manual, on compromise

Ed25519 identity keys live in ~/.muse/identity.toml on each client machine. No server-side secret is involved in MSign auth — the public key in the DB is the credential.

File History 1 commit

sha256:7d6dd8f4a89e2d1fef2d84f6e65feaff51385d382f466766b7f690a22ec18e32 fix: fall back to DB ancestry check when mpack-only fast-fo… Sonnet 4.6 patch 6 days ago