Closed #11 Bug

filed by gabriel human · 64 days ago

Replace 2-byte header with Git-idiomatic type prefix in unified object store

0 Anchors

— Blast radius

— Churn 30d

0 Proposals

Problem

The unified object store (#58) was implemented with a 2-byte binary header [type_byte, version_byte] prepended to object payloads. This is not how Git works.

Git bakes the type into the content before hashing and storing:

store:   "<type> <size>\0<payload>"
hash:    sha256("<type> <size>\0<payload>")
on disk: objects/<2hex>/<62hex>  ← same path layout, already correct

The type is recoverable by reading the object — no separate header needed. The object ID includes the type, so a blob and a commit with identical payload bytes will never collide.

What needs to change

Phase 1 — Replace the format primitives

Remove OBJECT_TYPE_COMMIT, OBJECT_TYPE_SNAPSHOT, OBJECT_TYPE_BLOB, OBJECT_FORMAT_V1 constants
Remove write_typed_object / read_typed_object from object_store.py
Add write_git_object(repo, type_str, payload) -> str — prepends "<type> <size>\0", hashes the full string, writes to objects/<algo>/<2>/<62>, returns the object ID
Add read_git_object(repo, object_id) -> tuple[str, bytes] | None — reads the file, parses up to \0 to extract type and size, returns (type_str, payload)

Phase 2 — Update `ids.py`

hash_blob(data) → sha256("blob <size>\0" + data)
hash_snapshot(manifest) → sha256("snapshot <size>\0" + canonical_bytes)
hash_commit(...) → sha256("commit <size>\0" + canonical_bytes)

All existing IDs change. Migration of old objects is a separate pass (already planned).

Phase 3 — Wire `write_commit` / `read_commit`

write_commit dual-writes: msgpack to .muse/commits/ (old), git-object to .muse/objects/ (new)
read_commit falls back to .muse/objects/ via read_git_object when old path absent

Phase 4 — Wire `write_snapshot` / `read_snapshot`

Same pattern as Phase 3.

Phase 5 — Blobs

write_object uses write_git_object with type_str="blob"
read_object uses read_git_object and strips the header

Phase 6 — Tests

Rewrite tests/test_unified_object_store.py to use the new primitives throughout. Remove all references to the 2-byte header constants.

Phase 7 — Migration

Backfill all existing .muse/commits/ and .muse/snapshots/ msgpack objects into .muse/objects/ in the git-idiomatic format. Delete old directories once migration is verified.

Acceptance criteria

All objects (blobs, snapshots, commits) live in .muse/objects/sha256/<2>/<62>
On-disk format is "<type> <size>\0<payload>" — no binary header
Object ID is sha256("<type> <size>\0<payload>") for all types
Reading any object by ID yields its type without any separate metadata
All 8 tests in test_unified_object_store.py pass against the new format
No references to OBJECT_TYPE_* or OBJECT_FORMAT_V1 remain in the codebase

◎ Activity2

●

gabriel opened this issue 64 days ago

○

gabriel 64 days ago

Remaining cleanup — load-bearing order

Migration ran successfully (9,706 blobs + 1,084 snapshots + 1,111 commits moved to unified object store). All seven phases are implemented. What remains is four shim-removal passes:

Pass 1 — delete `compute_commit_id` / `compute_snapshot_id` shims

File: muse/core/snapshot.py lines 339–511

Both are already one-liner wrappers that delegate to hash_commit / hash_snapshot in ids.py. The compute_commit_id docstring even says Deprecated: use muse.core.ids.hash_commit instead. There are ~15 call sites across 10 files (commit.py, merge.py, rebase.py, cherry_pick.py, pull.py, shelf.py, revert.py, bridge.py, commit_tree.py, snapshot_cmd.py, merge_tree.py, mpack.py, core/rebase.py). Each just needs its import changed from snapshot to ids. Zero behavior change, zero risk.

Pass 2 — flip read priority in `read_commit` / `read_snapshot`

File: muse/core/store.py

Currently both functions check the legacy msgpack path first and fall back to the object store only when the file is absent. That priority should be reversed: object store first, msgpack as legacy fallback. This makes the object store the canonical read source. No data loss risk — both paths hold the same records.

Pass 3 — stop dual-writing msgpack in `write_commit` / `write_snapshot`

File: muse/core/store.py lines 1632–1636 and 2210–2214

After pass 2 flips read priority, new commits and snapshots no longer need to land in .muse/commits/ and .muse/snapshots/. Remove the _write_msgpack_atomic call from each writer; keep only the object store write. Pass 2 and pass 3 can land in the same commit.

Pass 4 — delete legacy msgpack dirs and remove fallback reads

Files: muse/core/store.py + all repos on disk

Once we have run on object-store-only writes for a while and are confident, remove the fallback read branches in read_commit (lines 1736–1746) and read_snapshot (lines 2233–2244), then delete .muse/commits/ and .muse/snapshots/ from every repo. At that point the unified object store is the only storage layer — the ticket is fully closed.

○

gabriel 64 days ago

All four passes complete and committed as sha256:5b3273e7.

Pass 1: deleted compute_commit_id, compute_snapshot_id, snapshot_identity_bytes, commit_identity_bytes from snapshot.py; all 15 call sites updated to import hash_commit / hash_snapshot directly from ids.py
Pass 2: flipped read priority in read_commit / read_snapshot — object store first, msgpack fallback second
Pass 3: write_commit and write_snapshot write exclusively to the object store; commit_exists checks the object store path
Pass 4: removed msgpack fallback read branches; deleted 3,308 legacy msgpack files from .muse/commits/ and .muse/snapshots/

The unified object store is now the sole storage layer for all commits, snapshots, and blobs in the muse repo.

Assignee

gabriel human

Release

no commits linked to this issue

create

muse hub issue create \
  --title "..." \
  --body "..." \
  --label bug \
  --anchor path/to/file.py::Symbol \
  --commit-anchor <sha> \
  --repo gabriel/muse

read

muse hub issue get 11 --json
muse hub issue list --state open --json

update

muse hub issue edit 11 \
  --anchor path/to/file.py::Symbol \
  --repo gabriel/muse

comment

muse hub issue comment 11 \
  --body "Fixed in <sha>" \
  --repo gabriel/muse

reopen

muse hub issue reopen 11 \
  --repo gabriel/muse

create

create_issue({
  repo_id: "sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec",
  title: "...",
  body: "...",
  labels: ["bug"],
  symbol_anchors: [
    "path/to/file.py::Symbol"
  ],
  commit_anchors: ["<sha>"]
})

read

get_issue({
  repo_id: "sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec",
  issue_number: 11
})

list_issues({
  repo_id: "sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec",
  state: "open"
})

update

edit_issue({
  repo_id: "sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec",
  issue_number: 11,
  symbol_anchors: ["path/to/file.py::Symbol"]
})

comment

create_issue_comment({
  repo_id: "sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec",
  issue_number: 11,
  body: "..."
})

reopen

reopen_issue({
  repo_id: "sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec",
  issue_number: 11
})

create

curl -X POST \
  http://localhost:10003/api/repos/sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec/issues \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{
    "title": "...",
    "body": "...",
    "labels": ["bug"],
    "symbol_anchors": ["path/to/file.py::Symbol"]
  }'

read

# get one issue
curl http://localhost:10003/api/repos/sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec/issues/11

# list open issues
curl "http://localhost:10003/api/repos/sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec/issues?state=open"

update

curl -X PATCH \
  http://localhost:10003/api/repos/sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec/issues/11 \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{"title": "...", "body": "..."}'

comment

curl -X POST \
  http://localhost:10003/api/repos/sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec/issues/11/comments \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{"body": "Fixed in <sha>"}'

reopen

curl -X POST \
  http://localhost:10003/api/repos/sha256:200e8689fe34a831289bc1eca17633b2069d595379b7c2f57a158e35d8291bec/issues/11/reopen \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\""

Replace 2-byte header with Git-idiomatic type prefix in unified object store

Problem

What needs to change

Phase 1 — Replace the format primitives

Phase 2 — Update ids.py

Phase 3 — Wire write_commit / read_commit

Phase 4 — Wire write_snapshot / read_snapshot

Phase 5 — Blobs

Phase 6 — Tests

Phase 7 — Migration

Acceptance criteria

Remaining cleanup — load-bearing order

Pass 1 — delete compute_commit_id / compute_snapshot_id shims

Pass 2 — flip read priority in read_commit / read_snapshot

Pass 3 — stop dual-writing msgpack in write_commit / write_snapshot

Pass 4 — delete legacy msgpack dirs and remove fallback reads

Phase 2 — Update `ids.py`

Phase 3 — Wire `write_commit` / `read_commit`

Phase 4 — Wire `write_snapshot` / `read_snapshot`

Pass 1 — delete `compute_commit_id` / `compute_snapshot_id` shims

Pass 2 — flip read priority in `read_commit` / `read_snapshot`

Pass 3 — stop dual-writing msgpack in `write_commit` / `write_snapshot`