gabriel / muse public
Closed #10 Bug
filed by gabriel human · 29 days ago

Replace 2-byte header with Git-idiomatic type prefix in unified object store

0 Anchors
Blast radius
Churn 30d
0 Proposals

Problem

The unified object store (#58) was implemented with a 2-byte binary header [type_byte, version_byte] prepended to object payloads. This is not how Git works.

Git bakes the type into the content before hashing and storing:

store:   "<type> <size>\0<payload>"
hash:    sha256("<type> <size>\0<payload>")
on disk: objects/<2hex>/<62hex>  ← same path layout, already correct

The type is recoverable by reading the object — no separate header needed. The object ID includes the type, so a blob and a commit with identical payload bytes will never collide.

What needs to change

Phase 1 — Replace the format primitives

  • Remove OBJECT_TYPE_COMMIT, OBJECT_TYPE_SNAPSHOT, OBJECT_TYPE_BLOB, OBJECT_FORMAT_V1 constants
  • Remove write_typed_object / read_typed_object from object_store.py
  • Add write_git_object(repo, type_str, payload) -> str — prepends "<type> <size>\0", hashes the full string, writes to objects/<algo>/<2>/<62>, returns the object ID
  • Add read_git_object(repo, object_id) -> tuple[str, bytes] | None — reads the file, parses up to \0 to extract type and size, returns (type_str, payload)

Phase 2 — Update ids.py

  • hash_blob(data)sha256("blob <size>\0" + data)
  • hash_snapshot(manifest)sha256("snapshot <size>\0" + canonical_bytes)
  • hash_commit(...)sha256("commit <size>\0" + canonical_bytes)

All existing IDs change. Migration of old objects is a separate pass (already planned).

Phase 3 — Wire write_commit / read_commit

  • write_commit dual-writes: msgpack to .muse/commits/ (old), git-object to .muse/objects/ (new)
  • read_commit falls back to .muse/objects/ via read_git_object when old path absent

Phase 4 — Wire write_snapshot / read_snapshot

Same pattern as Phase 3.

Phase 5 — Blobs

  • write_object uses write_git_object with type_str="blob"
  • read_object uses read_git_object and strips the header

Phase 6 — Tests

Rewrite tests/test_unified_object_store.py to use the new primitives throughout. Remove all references to the 2-byte header constants.

Phase 7 — Migration

Backfill all existing .muse/commits/ and .muse/snapshots/ msgpack objects into .muse/objects/ in the git-idiomatic format. Delete old directories once migration is verified.

Acceptance criteria

  • All objects (blobs, snapshots, commits) live in .muse/objects/sha256/<2>/<62>
  • On-disk format is "<type> <size>\0<payload>" — no binary header
  • Object ID is sha256("<type> <size>\0<payload>") for all types
  • Reading any object by ID yields its type without any separate metadata
  • All 8 tests in test_unified_object_store.py pass against the new format
  • No references to OBJECT_TYPE_* or OBJECT_FORMAT_V1 remain in the codebase
Activity
gabriel opened this issue 29 days ago