Open #63 architecture

filed by gabriel human · 16 days ago

All objects (commits, snapshots, blobs) must be written to the object store — S3 canonical, DB is index

0 Anchors

— Blast radius

— Churn 30d

0 Proposals

Problem

The object store invariant is broken. Blobs are correctly written to S3 with a DB index row pointing to storage_uri. Commits and snapshots are not — they are DB-only, with no corresponding S3 bytes. This means:

S3 loss leaves commit history and snapshot manifests intact in DB; DB loss destroys them entirely
muse pull after a server-side merge fails because the merge commit was never written to S3 and wire_fetch_mpack has to reconstruct it from DB fields rather than serving canonical bytes
The on-disk format for snapshots in the DB (manifest_blob: msgpack) diverges from the canonical muse binary format (snapshot <size>\0<json>) used by the local object store and the wire protocol intent
msgpack is used in manifest_blob / delta_blob DB columns — these are not wire-format contexts and should not use msgpack

Correct invariant

Type	S3 (source of truth)	DB (queryable index / cache)
Blob	raw bytes	`MusehubObject`: id, path, size, storage_uri
Snapshot	`snapshot <size>\0<json>`	`MusehubSnapshot`: id, entry_count, dirs, storage_uri, manifest_blob (cache)
Commit	`commit <size>\0<json>`	`MusehubCommit`: id, branch, parents, message, storage_uri

The muse binary format (commit, snapshot, blob type-prefixed) is the canonical format everywhere — local object store, S3, and wire. msgpack is wire-envelope only (the outer mpack framing), not a storage format.

Blast radius

Commit write sites (DB-only today — each needs `backend.put(commit_id, commit_bytes)`)

musehub/services/musehub_proposals.py:753 — merge_proposal merge commit
musehub/services/musehub_repository.py:232 — repo init commit
musehub/services/musehub_sync.py:234 — sync path commit
musehub/services/musehub_sync.py:527 — commit_files_to_repo
musehub/services/musehub_wire_push.py:843 — push receive bulk insert

Snapshot write sites (DB-only with msgpack today — each needs `backend.put(snapshot_id, snapshot_bytes)`)

musehub/services/musehub_snapshot.py:187 — upsert_snapshot_entries (primary path)
musehub/services/musehub_wire_push.py:207 — push receive merge
musehub/services/musehub_wire_push.py:683 — push receive bulk insert

manifest_blob / delta_blob readers (heavy — keep as DB cache, read from S3 on miss)

musehub/services/musehub_snapshot.py — manifest decode helpers
musehub/services/musehub_wire_fetch.py — fetch delta computation (lines 240, 259, 437, 466, 875, 888)
musehub/services/musehub_wire_shared.py — _snap_row_to_wire and manifest chain walk
musehub/services/musehub_intel_providers.py — 10+ read sites for code intelligence
musehub/services/musehub_gc.py:151,168 — GC manifest walks
musehub/services/musehub_governance.py:105
musehub/services/musehub_auth.py:241
musehub/services/musehub_orgs.py:259
musehub/services/musehub_social.py:153
musehub/services/musehub_symbol_indexer.py:1121,1373
musehub/api/routes/api/identities.py:293

msgpack packb write sites to eliminate from non-wire paths

musehub/services/musehub_snapshot.py:184,227 — upsert_snapshot_entries and bulk upsert
musehub/services/musehub_wire_push.py:205,628,632 — push receive manifest/delta encoding

_snap_row_to_wire call sites (3 — replaced by direct S3 bytes serve)

musehub/services/musehub_wire_fetch.py:283,407,880

DB schema additions needed

MusehubCommit: add storage_uri column
MusehubSnapshot: add storage_uri column

Implementation — phased TDD plan

Phase 1 — Failing tests + DB schema

Write failing tests asserting:

After merge_proposal, backend.exists(merge_commit_id) is True
After merge_proposal, backend.exists(merged_snapshot_id) is True
After push receive, backend.exists(commit_id) is True for every received commit
After push receive, backend.exists(snapshot_id) is True for every received snapshot
After repo init, backend.exists(init_commit_id) is True
After commit_files_to_repo, backend.exists(commit_id) is True
Fetched commit bytes from S3 decode as valid commit <size>\0<json> (not msgpack)
Fetched snapshot bytes from S3 decode as valid snapshot <size>\0<json> (not msgpack)

Add Alembic migration: nullable storage_uri on musehub_commits and musehub_snapshots. All tests must be RED before proceeding.

Phase 2 — Commit write path goes green

Add backend.put(commit_id, commit_bytes) at all 5 commit write sites. Commit bytes format: commit <size>\0<json> (identical to muse local store). Populate MusehubCommit.storage_uri from the returned URI. Phase 1 commit tests go GREEN. Snapshot tests still RED.

Phase 3 — Snapshot write path goes green

Add backend.put(snapshot_id, snapshot_bytes) at all 3 snapshot write sites. Snapshot bytes format: snapshot <size>\0<json> (identical to muse local store). Keep manifest_blob column populated from the same decoded dict (DB cache — avoids S3 round-trips for intel/GC hot paths). Populate MusehubSnapshot.storage_uri from the returned URI. All Phase 1 tests go GREEN.

Phase 4 — Wire fetch serves canonical bytes

Replace _snap_row_to_wire with direct S3 read of snapshot <size>\0<json> bytes. Replace commit reconstruction in wire_fetch_mpack with direct S3 read of commit <size>\0<json> bytes. Fall back to DB reconstruction when storage_uri is null (pre-backfill rows). Add tests: fetched mpack contains bytes that decode cleanly as the canonical binary format.

Phase 5 — Backfill existing objects

Migration script: for every MusehubCommit and MusehubSnapshot where storage_uri IS NULL, reconstruct canonical bytes from DB fields and write to S3. Run on staging first. Verify muse pull works end-to-end for existing repos. Mark backfill complete when zero null storage_uri rows remain.

Phase 6 — Eliminate msgpack from non-wire paths

Remove msgpack from upsert_snapshot_entries and push receive snapshot write paths. manifest_blob stays as a JSON-encoded DB cache (switch from msgpack bytes to JSON if it simplifies the column, or keep msgpack for compactness with a clear comment that it is a cache, not the source of truth). Remove delta_blob-based manifest reconstruction from _snap_row_to_wire (now dead code). Clean up all non-wire msgpack packb/unpackb call sites.

Definition of done

Every commit and snapshot written anywhere in musehub results in a corresponding S3 object in muse binary format
muse pull from staging after a server-side merge commit works end-to-end
muse verify on a freshly pulled repo passes
S3 is the disaster-recovery source of truth: a fresh DB seeded only from S3 can reconstruct all commit and snapshot metadata
No msgpack in storage paths — only in wire-framing

◎ Activity

●

gabriel opened this issue 16 days ago

◎

No activity yet. Use the CLI to comment.

Assignee

gabriel human

Release

no commits linked to this issue

create

muse hub issue create \
  --title "..." \
  --body "..." \
  --label bug \
  --anchor path/to/file.py::Symbol \
  --commit-anchor <sha> \
  --repo gabriel/musehub

read

muse hub issue get 63 --json
muse hub issue list --state open --json

update

muse hub issue edit 63 \
  --anchor path/to/file.py::Symbol \
  --repo gabriel/musehub

comment

muse hub issue comment 63 \
  --body "Fixed in <sha>" \
  --repo gabriel/musehub

muse hub issue close 63 \
  --repo gabriel/musehub

create

create_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  title: "...",
  body: "...",
  labels: ["bug"],
  symbol_anchors: [
    "path/to/file.py::Symbol"
  ],
  commit_anchors: ["<sha>"]
})

read

get_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 63
})

list_issues({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  state: "open"
})

update

edit_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 63,
  symbol_anchors: ["path/to/file.py::Symbol"]
})

comment

create_issue_comment({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 63,
  body: "..."
})

close_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 63
})

create

curl -X POST \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{
    "title": "...",
    "body": "...",
    "labels": ["bug"],
    "symbol_anchors": ["path/to/file.py::Symbol"]
  }'

read

# get one issue
curl http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/63

# list open issues
curl "http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues?state=open"

update

curl -X PATCH \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/63 \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{"title": "...", "body": "..."}'

comment

curl -X POST \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/63/comments \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{"body": "Fixed in <sha>"}'

curl -X POST \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/63/close \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\""

All objects (commits, snapshots, blobs) must be written to the object store — S3 canonical, DB is index

Problem

Correct invariant

Blast radius

Commit write sites (DB-only today — each needs backend.put(commit_id, commit_bytes))

Snapshot write sites (DB-only with msgpack today — each needs backend.put(snapshot_id, snapshot_bytes))

manifest_blob / delta_blob readers (heavy — keep as DB cache, read from S3 on miss)

msgpack packb write sites to eliminate from non-wire paths

_snap_row_to_wire call sites (3 — replaced by direct S3 bytes serve)

DB schema additions needed

Implementation — phased TDD plan

Phase 1 — Failing tests + DB schema

Phase 2 — Commit write path goes green

Phase 3 — Snapshot write path goes green

Phase 4 — Wire fetch serves canonical bytes

Phase 5 — Backfill existing objects

Phase 6 — Eliminate msgpack from non-wire paths

Definition of done

Commit write sites (DB-only today — each needs `backend.put(commit_id, commit_bytes)`)

Snapshot write sites (DB-only with msgpack today — each needs `backend.put(snapshot_id, snapshot_bytes)`)