gabriel / musehub public
filed by gabriel human · 16 days ago

All objects (commits, snapshots, blobs) must be written to the object store — S3 canonical, DB is index

0 Anchors
Blast radius
Churn 30d
0 Proposals

Problem

The object store invariant is broken. Blobs are correctly written to S3 with a DB index row pointing to storage_uri. Commits and snapshots are not — they are DB-only, with no corresponding S3 bytes. This means:

  • S3 loss leaves commit history and snapshot manifests intact in DB; DB loss destroys them entirely
  • muse pull after a server-side merge fails because the merge commit was never written to S3 and wire_fetch_mpack has to reconstruct it from DB fields rather than serving canonical bytes
  • The on-disk format for snapshots in the DB (manifest_blob: msgpack) diverges from the canonical muse binary format (snapshot <size>\0<json>) used by the local object store and the wire protocol intent
  • msgpack is used in manifest_blob / delta_blob DB columns — these are not wire-format contexts and should not use msgpack

Correct invariant

Type S3 (source of truth) DB (queryable index / cache)
Blob raw bytes MusehubObject: id, path, size, storage_uri
Snapshot snapshot <size>\0<json> MusehubSnapshot: id, entry_count, dirs, storage_uri, manifest_blob (cache)
Commit commit <size>\0<json> MusehubCommit: id, branch, parents, message, storage_uri

The muse binary format (commit, snapshot, blob type-prefixed) is the canonical format everywhere — local object store, S3, and wire. msgpack is wire-envelope only (the outer mpack framing), not a storage format.

Blast radius

Commit write sites (DB-only today — each needs backend.put(commit_id, commit_bytes))

  • musehub/services/musehub_proposals.py:753merge_proposal merge commit
  • musehub/services/musehub_repository.py:232 — repo init commit
  • musehub/services/musehub_sync.py:234 — sync path commit
  • musehub/services/musehub_sync.py:527commit_files_to_repo
  • musehub/services/musehub_wire_push.py:843 — push receive bulk insert

Snapshot write sites (DB-only with msgpack today — each needs backend.put(snapshot_id, snapshot_bytes))

  • musehub/services/musehub_snapshot.py:187upsert_snapshot_entries (primary path)
  • musehub/services/musehub_wire_push.py:207 — push receive merge
  • musehub/services/musehub_wire_push.py:683 — push receive bulk insert

manifest_blob / delta_blob readers (heavy — keep as DB cache, read from S3 on miss)

  • musehub/services/musehub_snapshot.py — manifest decode helpers
  • musehub/services/musehub_wire_fetch.py — fetch delta computation (lines 240, 259, 437, 466, 875, 888)
  • musehub/services/musehub_wire_shared.py_snap_row_to_wire and manifest chain walk
  • musehub/services/musehub_intel_providers.py — 10+ read sites for code intelligence
  • musehub/services/musehub_gc.py:151,168 — GC manifest walks
  • musehub/services/musehub_governance.py:105
  • musehub/services/musehub_auth.py:241
  • musehub/services/musehub_orgs.py:259
  • musehub/services/musehub_social.py:153
  • musehub/services/musehub_symbol_indexer.py:1121,1373
  • musehub/api/routes/api/identities.py:293

msgpack packb write sites to eliminate from non-wire paths

  • musehub/services/musehub_snapshot.py:184,227upsert_snapshot_entries and bulk upsert
  • musehub/services/musehub_wire_push.py:205,628,632 — push receive manifest/delta encoding

_snap_row_to_wire call sites (3 — replaced by direct S3 bytes serve)

  • musehub/services/musehub_wire_fetch.py:283,407,880

DB schema additions needed

  • MusehubCommit: add storage_uri column
  • MusehubSnapshot: add storage_uri column

Implementation — phased TDD plan

Phase 1 — Failing tests + DB schema

Write failing tests asserting:

  • After merge_proposal, backend.exists(merge_commit_id) is True
  • After merge_proposal, backend.exists(merged_snapshot_id) is True
  • After push receive, backend.exists(commit_id) is True for every received commit
  • After push receive, backend.exists(snapshot_id) is True for every received snapshot
  • After repo init, backend.exists(init_commit_id) is True
  • After commit_files_to_repo, backend.exists(commit_id) is True
  • Fetched commit bytes from S3 decode as valid commit <size>\0<json> (not msgpack)
  • Fetched snapshot bytes from S3 decode as valid snapshot <size>\0<json> (not msgpack)

Add Alembic migration: nullable storage_uri on musehub_commits and musehub_snapshots. All tests must be RED before proceeding.

Phase 2 — Commit write path goes green

Add backend.put(commit_id, commit_bytes) at all 5 commit write sites. Commit bytes format: commit <size>\0<json> (identical to muse local store). Populate MusehubCommit.storage_uri from the returned URI. Phase 1 commit tests go GREEN. Snapshot tests still RED.

Phase 3 — Snapshot write path goes green

Add backend.put(snapshot_id, snapshot_bytes) at all 3 snapshot write sites. Snapshot bytes format: snapshot <size>\0<json> (identical to muse local store). Keep manifest_blob column populated from the same decoded dict (DB cache — avoids S3 round-trips for intel/GC hot paths). Populate MusehubSnapshot.storage_uri from the returned URI. All Phase 1 tests go GREEN.

Phase 4 — Wire fetch serves canonical bytes

Replace _snap_row_to_wire with direct S3 read of snapshot <size>\0<json> bytes. Replace commit reconstruction in wire_fetch_mpack with direct S3 read of commit <size>\0<json> bytes. Fall back to DB reconstruction when storage_uri is null (pre-backfill rows). Add tests: fetched mpack contains bytes that decode cleanly as the canonical binary format.

Phase 5 — Backfill existing objects

Migration script: for every MusehubCommit and MusehubSnapshot where storage_uri IS NULL, reconstruct canonical bytes from DB fields and write to S3. Run on staging first. Verify muse pull works end-to-end for existing repos. Mark backfill complete when zero null storage_uri rows remain.

Phase 6 — Eliminate msgpack from non-wire paths

Remove msgpack from upsert_snapshot_entries and push receive snapshot write paths. manifest_blob stays as a JSON-encoded DB cache (switch from msgpack bytes to JSON if it simplifies the column, or keep msgpack for compactness with a clear comment that it is a cache, not the source of truth). Remove delta_blob-based manifest reconstruction from _snap_row_to_wire (now dead code). Clean up all non-wire msgpack packb/unpackb call sites.

Definition of done

  • Every commit and snapshot written anywhere in musehub results in a corresponding S3 object in muse binary format
  • muse pull from staging after a server-side merge commit works end-to-end
  • muse verify on a freshly pulled repo passes
  • S3 is the disaster-recovery source of truth: a fresh DB seeded only from S3 can reconstruct all commit and snapshot metadata
  • No msgpack in storage paths — only in wire-framing
Activity
gabriel opened this issue 16 days ago
No activity yet. Use the CLI to comment.