gabriel / muse public
Closed #11 Bug
filed by gabriel human · 19 days ago

Replace 2-byte header with Git-idiomatic type prefix in unified object store

0 Anchors
Blast radius
Churn 30d
0 Proposals

Problem

The unified object store (#58) was implemented with a 2-byte binary header [type_byte, version_byte] prepended to object payloads. This is not how Git works.

Git bakes the type into the content before hashing and storing:

store:   "<type> <size>\0<payload>"
hash:    sha256("<type> <size>\0<payload>")
on disk: objects/<2hex>/<62hex>  ← same path layout, already correct

The type is recoverable by reading the object — no separate header needed. The object ID includes the type, so a blob and a commit with identical payload bytes will never collide.

What needs to change

Phase 1 — Replace the format primitives

  • Remove OBJECT_TYPE_COMMIT, OBJECT_TYPE_SNAPSHOT, OBJECT_TYPE_BLOB, OBJECT_FORMAT_V1 constants
  • Remove write_typed_object / read_typed_object from object_store.py
  • Add write_git_object(repo, type_str, payload) -> str — prepends "<type> <size>\0", hashes the full string, writes to objects/<algo>/<2>/<62>, returns the object ID
  • Add read_git_object(repo, object_id) -> tuple[str, bytes] | None — reads the file, parses up to \0 to extract type and size, returns (type_str, payload)

Phase 2 — Update ids.py

  • hash_blob(data)sha256("blob <size>\0" + data)
  • hash_snapshot(manifest)sha256("snapshot <size>\0" + canonical_bytes)
  • hash_commit(...)sha256("commit <size>\0" + canonical_bytes)

All existing IDs change. Migration of old objects is a separate pass (already planned).

Phase 3 — Wire write_commit / read_commit

  • write_commit dual-writes: msgpack to .muse/commits/ (old), git-object to .muse/objects/ (new)
  • read_commit falls back to .muse/objects/ via read_git_object when old path absent

Phase 4 — Wire write_snapshot / read_snapshot

Same pattern as Phase 3.

Phase 5 — Blobs

  • write_object uses write_git_object with type_str="blob"
  • read_object uses read_git_object and strips the header

Phase 6 — Tests

Rewrite tests/test_unified_object_store.py to use the new primitives throughout. Remove all references to the 2-byte header constants.

Phase 7 — Migration

Backfill all existing .muse/commits/ and .muse/snapshots/ msgpack objects into .muse/objects/ in the git-idiomatic format. Delete old directories once migration is verified.

Acceptance criteria

  • All objects (blobs, snapshots, commits) live in .muse/objects/sha256/<2>/<62>
  • On-disk format is "<type> <size>\0<payload>" — no binary header
  • Object ID is sha256("<type> <size>\0<payload>") for all types
  • Reading any object by ID yields its type without any separate metadata
  • All 8 tests in test_unified_object_store.py pass against the new format
  • No references to OBJECT_TYPE_* or OBJECT_FORMAT_V1 remain in the codebase
Activity2
gabriel opened this issue 19 days ago
gabriel 19 days ago

Remaining cleanup — load-bearing order

Migration ran successfully (9,706 blobs + 1,084 snapshots + 1,111 commits moved to unified object store). All seven phases are implemented. What remains is four shim-removal passes:


Pass 1 — delete compute_commit_id / compute_snapshot_id shims

File: muse/core/snapshot.py lines 339–511

Both are already one-liner wrappers that delegate to hash_commit / hash_snapshot in ids.py. The compute_commit_id docstring even says Deprecated: use muse.core.ids.hash_commit instead. There are ~15 call sites across 10 files (commit.py, merge.py, rebase.py, cherry_pick.py, pull.py, shelf.py, revert.py, bridge.py, commit_tree.py, snapshot_cmd.py, merge_tree.py, mpack.py, core/rebase.py). Each just needs its import changed from snapshot to ids. Zero behavior change, zero risk.


Pass 2 — flip read priority in read_commit / read_snapshot

File: muse/core/store.py

Currently both functions check the legacy msgpack path first and fall back to the object store only when the file is absent. That priority should be reversed: object store first, msgpack as legacy fallback. This makes the object store the canonical read source. No data loss risk — both paths hold the same records.


Pass 3 — stop dual-writing msgpack in write_commit / write_snapshot

File: muse/core/store.py lines 1632–1636 and 2210–2214

After pass 2 flips read priority, new commits and snapshots no longer need to land in .muse/commits/ and .muse/snapshots/. Remove the _write_msgpack_atomic call from each writer; keep only the object store write. Pass 2 and pass 3 can land in the same commit.


Pass 4 — delete legacy msgpack dirs and remove fallback reads

Files: muse/core/store.py + all repos on disk

Once we have run on object-store-only writes for a while and are confident, remove the fallback read branches in read_commit (lines 1736–1746) and read_snapshot (lines 2233–2244), then delete .muse/commits/ and .muse/snapshots/ from every repo. At that point the unified object store is the only storage layer — the ticket is fully closed.

gabriel 19 days ago

All four passes complete and committed as sha256:5b3273e7.

  • Pass 1: deleted compute_commit_id, compute_snapshot_id, snapshot_identity_bytes, commit_identity_bytes from snapshot.py; all 15 call sites updated to import hash_commit / hash_snapshot directly from ids.py
  • Pass 2: flipped read priority in read_commit / read_snapshot — object store first, msgpack fallback second
  • Pass 3: write_commit and write_snapshot write exclusively to the object store; commit_exists checks the object store path
  • Pass 4: removed msgpack fallback read branches; deleted 3,308 legacy msgpack files from .muse/commits/ and .muse/snapshots/

The unified object store is now the sole storage layer for all commits, snapshots, and blobs in the muse repo.