gabriel / muse public
Closed #12 task
filed by gabriel human · 18 days ago

Remove msgpack from storage layer — commit to git-header+JSON and plain JSON

0 Anchors
Blast radius
Churn 30d
0 Proposals

Problem

The codebase is in an inconsistent serialization state that will cause silent bugs and maintainability debt:

  1. The naming lie: CommitRecord.from_msgpack(), SnapshotRecord.from_msgpack(), etc. — these methods accept plain Python dicts (from JSON parsing), not binary msgpack bytes. The name is wrong.

  2. The extension lie: .muse/commits/*.msgpack and .muse/snapshots/*.msgpack files exist on disk. The write_commit function writes "commit {len}\0{json}" — a git-object-style header with JSON body. NOT binary msgpack. But the files have a .msgpack extension.

  3. The regression: commit_path() now correctly returns _object_path().muse/objects/<2>/<62>. But the old .muse/commits/*.msgpack files still exist and are now unreachableread_commit() will never find them. Those commits are silently lost unless we migrate them.

  4. Real binary msgpack still lives in storage: tags, releases, shelf entries, stage index, caches, and code indices all still use _write_msgpack_atomic / _read_msgpack. These need to be converted.

  5. WireMPack is fine: The push/pull wire protocol uses binary msgpack intentionally. This stays. It is the only legitimate use of the msgpack Python library.

Goal

One clear rule for every byte in this system:

  • Wire transport: binary msgpack (WireMPack protocol) — stays
  • Content-addressed objects (commits, snapshots, blobs): git-object-style header + JSON — already correct, extend to tags/shelf
  • Named config-style files (tags, releases, stage index): plain JSON
  • Ephemeral caches + indices: plain JSON (stale cache = cache miss, safe to discard)
  • No import msgpack outside of transport/bundle/pack-objects: enforced by a lint check added in Phase 1

Phases

Phase 1 — Naming fix (zero functional change, TDD)

What: Rename from_msgpackfrom_dict on all four record classes. Fix every docstring that says ".msgpack" when it means the unified object store (which has no extension). Add a lint rule that fails CI if import msgpack appears outside muse/core/transport.py, muse/core/mpack.py, muse/cli/commands/bundle.py, muse/cli/commands/pack_objects.py, muse/cli/commands/unpack_objects.py.

Records to rename:

  • CommitRecord.from_msgpack()from_dict()
  • SnapshotRecord.from_msgpack()from_dict()
  • TagRecord.from_msgpack()from_dict()
  • ReleaseRecord.from_msgpack()from_dict()

Tests first:

  • test_record_from_dict_commit — CommitRecord.from_dict accepts the same input as from_msgpack did
  • test_record_from_dict_snapshot — same for SnapshotRecord
  • test_record_from_dict_tag — same for TagRecord
  • test_record_from_dict_release — same for ReleaseRecord
  • test_msgpack_import_lint — assert no import msgpack in storage files (allowlist enforced)

Phase 2 — Migrate legacy .muse/commits/ and .muse/snapshots/ (TDD)

What: These directories contain binary msgpack commits/snapshots from before the unified object store migration. They are currently unreadable because read_commit() looks in .muse/objects/ not .muse/commits/. This is a silent data loss regression.

Fix:

  • Extend muse code migrate (or add muse migrate legacy-store) to walk .muse/commits/<algo>/*.msgpack and .muse/snapshots/<algo>/*.msgpack
  • Deserialise each file with the OLD binary msgpack reader (we keep the reader temporarily)
  • Write each record via write_commit / write_snapshot to the unified object store
  • After migration: delete the legacy files and directories
  • Remove commits_dir and snapshots_dir from paths.py INIT_DIRS list (they must no longer be created by muse init)
  • Remove the legacy commits and snapshots entries from repo.py INIT_DIRS

Tests first:

  • test_legacy_commits_migrate — populate .muse/commits/ with real binary msgpack commits, run migrate, verify read_commit() finds them in .muse/objects/
  • test_legacy_snapshots_migrate — same for snapshots
  • test_legacy_dir_cleaned_after_migrate.muse/commits/ removed after migration
  • test_init_does_not_create_commits_dirmuse init no longer creates .muse/commits/
  • test_read_commit_never_touches_commits_dirread_commit returns None for IDs that only exist in the legacy dir (pre-migrate)

Phase 3 — Tags and releases: binary msgpack → plain JSON (TDD)

What: write_tag and write_release currently use _write_msgpack_atomic. Convert to plain JSON files. Tags live in .muse/tags/, releases in .muse/releases/ — named files, not content-addressed, so no git-object framing needed.

Fix:

  • Replace _write_msgpack_atomic(path, tag.to_dict()) with write_json_atomic(path, tag.to_dict())
  • Replace _read_msgpack_dict(path) with read_json_file(path) in read_tag / list_tags
  • Drop .msgpack file extension from tag and release paths
  • Add read-upgrade path: on first read, if the old .msgpack file exists and the new .json file does not, read the msgpack and write the JSON (silent migration)
  • After one release cycle, remove the msgpack read-upgrade path

Tests first:

  • test_tag_write_read_json — write tag, verify file is valid JSON, read it back
  • test_release_write_read_json — same for releases
  • test_tag_legacy_msgpack_upgrade — place an old .msgpack tag file, verify it's readable and auto-migrated
  • test_release_legacy_msgpack_upgrade — same for releases
  • test_list_tags_mixed_formats — directory has both old msgpack and new json files, list returns both

Phase 4 — Shelf entries: binary msgpack → git-header+JSON (TDD)

What: Shelf entries are content-addressed (.muse/shelf/<algo>/<hex>.msgpack). Convert to the same git-object-style framing as commits: shelf <size>\0<json>. Drop the .msgpack extension — shelf files in the unified layout have no extension.

Fix:

  • write_shelf_entry: write "shelf {len}\0{json}" bytes to .muse/shelf/<algo>/<hex> (no extension)
  • read_shelf_entry: read with the same header-strip + JSON logic as read_commit
  • list_shelf_entries: change glob from "*/*.msgpack" to "*/*" (filter by path depth, not extension)
  • shelf_entry_path: remove .msgpack suffix
  • Migration: read_shelf_entry falls back to the old .msgpack path on miss, reads msgpack, writes new format — silent one-shot upgrade per entry

Tests first:

  • test_shelf_write_read_new_format — save/pop round-trip with new format
  • test_shelf_entry_no_extension — verify file has no extension
  • test_shelf_legacy_msgpack_readable — old .msgpack entry is readable and auto-migrated
  • test_shelf_list_new_format — list_shelf_entries finds new-format files
  • test_shelf_gc_walks_new_paths — GC reachability walk finds entries at new path

Phase 5 — Stage index: binary msgpack → plain JSON (TDD)

What: muse/plugins/code/stage.py uses msgpack for the stage index. The stage index is a working-tree artifact — if unreadable it's safe to treat as empty (user re-stages). Convert to plain JSON.

Fix:

  • write_stage: write plain JSON
  • read_stage: read JSON; on JSONDecodeError or old msgpack magic bytes, return empty dict (silent reset)
  • Bump the stage schema version to invalidate all existing msgpack stage files

Tests first:

  • test_stage_write_read_json — add file to stage, verify JSON on disk, read back
  • test_stage_stale_msgpack_treated_as_empty — place old msgpack stage file, verify read_stage returns {}
  • test_stage_add_after_stale_reset — after stale reset, new adds work correctly

Phase 6 — Caches and indices: binary msgpack → plain JSON (TDD)

What: All cache and index files under .muse/cache/ and .muse/indices/ use binary msgpack. These are all ephemeral and rebuildable — a stale/unreadable cache is always a cache miss.

Files: cache_base.py, stat_cache.py, callgraph_cache.py, symbol_cache.py, implicit_edge_cache.py, test_history.py, doc_history.py, patch_record.py, graph.py, indices.py

Fix:

  • Replace msgpack.packbjson.dumps(...).encode()
  • Replace msgpack.unpackbjson.loads(...)
  • On read failure or format mismatch: return cache miss (empty/None) — same as today on corrupt msgpack
  • Bump cache schema versions to invalidate old files

Tests first (one representative test per subsystem):

  • test_stat_cache_json_roundtrip
  • test_callgraph_cache_json_roundtrip
  • test_symbol_cache_json_roundtrip
  • test_indices_json_roundtrip
  • test_cache_stale_msgpack_is_miss — old msgpack bytes → cache miss, no crash

Phase 7 — Remove msgpack from storage, document wire-only contract (TDD)

What: After phases 1–6, verify that the msgpack Python library is only imported in wire-protocol files. Add this as a hard lint test. Update documentation.

Files that keep import msgpack (wire-only allowlist):

  • muse/core/transport.py
  • muse/core/mpack.py
  • muse/cli/commands/bundle.py
  • muse/cli/commands/pack_objects.py
  • muse/cli/commands/unpack_objects.py
  • muse/cli/commands/verify_pack.py

Tests first:

  • test_no_msgpack_import_in_storage — parse AST of every .py file under muse/core/ and muse/plugins/; assert none import msgpack
  • test_no_msgpack_import_in_cli_storage — same for muse/cli/commands/ minus the allowlist
  • test_wire_allowlist_files_exist — every file in the allowlist exists (guard against stale allowlist)

Acceptance Criteria

  • All four record classes use from_dict() — no from_msgpack method exists
  • .muse/commits/ and .muse/snapshots/ directories are not created by muse init
  • muse code migrate (or muse migrate legacy-store) successfully migrates repos that have legacy msgpack commits/snapshots
  • Tags and releases are stored as plain JSON files (no .msgpack extension)
  • Shelf entries use git-header+JSON, no .msgpack extension
  • Stage index is plain JSON
  • All caches and indices are plain JSON
  • import msgpack appears ONLY in wire-protocol files (enforced by test)
  • All phase tests pass, typing ratchet stays at 0 any / 0 untyped
  • muse verify passes on repos migrated from the old format

Out of scope

  • Changing the WireMPack wire format — the push/pull protocol stays as binary msgpack
  • Changing the git-object-style framing for commits/snapshots in the object store — that's already correct
  • Any changes to the musehub server — this is a client-side storage migration only
Activity9
gabriel opened this issue 18 days ago
gabriel 18 days ago

Phase 1 complete — commit sha256:2d04edd1c8bae30686ae49061151b3f3ddb98d48136d76cde3af696a03e9b33e landed on dev and main.

Changes:

  • CommitRecord, SnapshotRecord, TagRecord, ReleaseRecord: merged from_msgpack + from_dict into a single defensive from_dict, deleted from_msgpack
  • SymbolHistoryEntry: same merge
  • muse/core/test_history.py: _record_from_msgpack → _record_from_dict, _record_to_msgpack → _record_to_dict
  • All 9 source call sites and all test files updated
  • Added tests/test_phase1_naming.py: 21 TDD tests enforcing no from_msgpack remains anywhere

167/167 tests pass. Typing ratchet at zero.

gabriel 18 days ago

Phase 2 — current state audit (pre-implementation)

Data integrity check on the live muse repo

Ran a full inspection of .muse/commits/sha256/ and .muse/snapshots/sha256/ against the object store:

  • 14 commits in .muse/commits/sha256/*.msgpack — 10 already in object store, 4 missing
  • 1 snapshot in .muse/snapshots/sha256/*.msgpack — already in object store

The 4 commits missing from the object store are not reachable from any branch head — walked all 1,096 ancestors from HEAD and none appear. They are genuine orphans from superseded work, not live history. No active data loss.


What the existing migrate does and doesn't do

Works:

  • Phase 7 (migrate_commit_ids + migrate_snapshot_ids) correctly reads legacy msgpack files, rewrites them to .muse/objects/sha256/<2>/<62> in commit/snapshot <size>\0<json> format, and updates refs. This ran yesterday — 10 of 14 commits landed in the object store.

Incomplete (Phase 2 gaps):

  1. No cleanup after Phase 7.muse/commits/ and .muse/snapshots/ are never deleted after migration. Legacy files persist as orphan garbage.
  2. Main migrate loop still writes to the wrong place_write_raw_commit() writes canonical commits back to .muse/commits/sha256/<hex>.msgpack. Phase 7 then reads those back and copies them to the object store. It works but is redundant and repopulates the legacy dir on every migrate run.
  3. init_repo_dirs() still creates legacy dirspaths.py init_repo_dirs() creates commits_dir and snapshots_dir on every muse init. New repos start with empty legacy dirs that should not exist.
  4. No tests for cleanup behavior — no assertion that the legacy dirs are removed after migrate, or that muse init doesn't create them.

Plan for Phase 2 implementation

  • Write TDD tests first (as specified in the issue)
  • migrate should rescue the 4 orphaned commits into the object store before deleting — never silently discard reachable-or-not content
  • After writing all legacy content to the object store, delete .muse/commits/ and .muse/snapshots/ entirely
  • Remove commits_dir and snapshots_dir from init_repo_dirs() in paths.py
  • The 4 orphan commits that end up in the object store but unreferenced by any branch will be naturally collected by muse gc — no special handling needed; that's exactly what gc is for
gabriel 17 days ago

Phase 2 complete.

What was done:

  • migrate() now deletes .muse/commits/ and .muse/snapshots/ after all content is written to the unified object store (dry_run=True preserves them)
  • init_repo_dirs() no longer creates commits_dir or snapshots_dirmuse init is now legacy-dir-free
  • All legacy commits/snapshots are rescued into the object store before the directories are removed; orphaned objects unreferenced by any branch will be collected by muse gc
  • 12 TDD tests added in tests/test_phase2_legacy_store_migration.py covering: pre-migrate read returns None, post-migrate readability, field preservation, chain migration, snapshot migration, dir cleanup, dry_run preservation, and init guard
  • tests/test_code_migrate.py updated: removed dead _read_raw_commit duplicate, fixed helper to read from object store (with legacy fallback for dry-run tests), updated 4 path assertions to check the object store instead of the deleted legacy directories

How other musers migrate their repos: Run muse code migrate — it is now a one-stop shop. Migrates blob IDs, snapshot IDs, commit IDs, and cleans up the legacy dirs in one pass. The --dry-run flag previews without touching anything. After migrate completes, muse gc will collect any unreachable orphans.

gabriel 17 days ago

Phase 2 — complete ✓

Commit: sha256:2c1ca28035d516023ea39067cd17c86e38631cd982a6ec9aee235870e015d373

Changes landed on both dev and main:

  • muse/core/migrate.py — added shutil.rmtree cleanup at end of migrate() for both .muse/commits/ and .muse/snapshots/ (guarded by dry_run)
  • muse/core/paths.py — removed commits_dir and snapshots_dir from init_repo_dirs(); muse init no longer creates legacy directories
  • tests/test_phase2_legacy_store_migration.py — 12 new TDD tests (all green)
  • tests/test_code_migrate.py — updated helpers to read from the unified object store; fixed 5 path assertions that referenced the now-deleted legacy directories

Migration path for other musers: muse code migrate is the one-stop shop — rescues legacy binary msgpack commits/snapshots into the object store, then deletes the legacy dirs. --dry-run to preview. muse gc cleans up any unreachable orphans afterward.

gabriel 17 days ago

Phase 3 — complete ✓

Commit: sha256:13593414ec8a77e9d7143c16a14a15c0dccf52dd4b9e8f450119d09a1e88f942

Changes landed on both dev and main:

  • muse/core/store.py — added _write_json_atomic() helper (same mkstemp+fsync+rename crash safety as _write_msgpack_atomic); tag_path() and release_path() now return .json paths; write_tag() and write_release() write plain JSON; all readers (get_all_tags, get_tags_for_commit, list_releases, read_release) glob */* (both extensions), silently migrate .msgpack.json on first read and delete the old file
  • muse/core/gc.py_collect_reachable_commits() updated to read .json tag files (with .msgpack fallback for legacy)
  • tests/test_phase3_tags_releases_json.py — 20 new TDD tests (all green)

No regressions: 524 existing tag/release tests pass.

gabriel 17 days ago

Phase 4 complete — shelf entries: git-header+JSON, no extension

Commit: sha256:5cb3eb55c963

What changed

  • shelf_entry_path() now returns .muse/shelf/<algo>/<hex> with no file extension (was .msgpack)
  • write_shelf_entry() writes shelf <size>\0<json> framing — same header pattern as commits/snapshots
  • read_shelf_entry() reads new format; silently migrates legacy .msgpack entries on first read (writes new format, deletes old file)
  • list_shelf_entries() handles both formats in mixed directories
  • delete_shelf_entry() removes both new-format and legacy .msgpack files
  • gc._collect_shelf_objects() reads object IDs from new-format shelf entries

Tests

  • 18 new tests in tests/test_phase4_shelf_json.py covering all Phase 4 requirements
  • 11 updated tests in tests/test_shelf_msgpack_storage.py (extension assertions, glob patterns)
  • 1 updated test in tests/test_cmd_shelf.py (fsync check now targets _write_shelf_header_atomic)
  • All 223 shelf tests pass; Phase 3 and Phase 4 test suites (38 tests) fully green

Remaining phases

  • Phase 5: stage index → JSON
  • Phase 6: caches/indices → JSON
  • Phase 7: remove msgpack dependency entirely
gabriel 17 days ago

Phase 5 complete — stage index: binary msgpack → plain JSON

Commit: sha256:3ffc2032aee1

What changed

  • code_stage_path() now returns .muse/code/stage.json (was stage.msgpack)
  • write_stage() writes UTF-8 JSON via atomic fsync+rename — same crash-safety guarantees as before
  • read_stage() parses JSON; on old binary msgpack content (first byte > 0x7F), logs a warning and returns {} — safe silent reset
  • Schema version bumped to 3 — all existing stage.msgpack files are silently discarded on next read
  • Removed the obsolete stage.json → stage.msgpack migration block that is now a no-op

Tests

  • 16 new tests in tests/test_phase5_stage_index_json.py
  • 1 updated test in tests/test_code_stage.py (migration concept reversed — JSON is now primary)
  • 85 stage tests pass

Remaining phases

  • Phase 6: caches and indices under .muse/cache/ and .muse/indices/ → JSON
  • Phase 7: remove import msgpack from all non-wire files; enforce via lint
gabriel 17 days ago

Phase 7 complete — msgpack confined to wire-protocol files, lint enforced

Commit: sha256:de8316d5608c

Changes

Source fix:

  • muse/cli/commands/sign.py — removed lazy import msgpack from HTTP response decoder. sign request responses are always JSON from the hub; raw-text fallback is sufficient.

New test file: tests/test_phase7_no_msgpack_in_storage.py — 4 AST-based lint tests:

  • test_all_allowlist_files_exist — every file in the wire allowlist exists on disk (guard against stale entries)
  • test_no_msgpack_import_in_core_storage — no muse/core/ file outside the allowlist imports msgpack
  • test_no_msgpack_import_in_plugins — no muse/plugins/ file imports msgpack (fully clean)
  • test_no_msgpack_import_in_cli_storage — no muse/cli/commands/ file outside the allowlist imports msgpack

Wire-only allowlist (9 files):

muse/core/store.py         object store binary format (commits, snapshots, tags)
muse/core/gc.py            reads legacy msgpack object store files
muse/core/transport.py     wire transport protocol (WireMPack)
muse/core/migrate.py       legacy msgpack→JSON migration tooling
muse/core/mpack.py         MPack assembly/apply for wire transfer
muse/cli/commands/bundle.py
muse/cli/commands/pack_objects.py
muse/cli/commands/unpack_objects.py
muse/cli/commands/verify_pack.py

Issue #12 complete — all 7 phases done

Phase What Status
1 Naming / extensions — tags, releases, refs to .json
2 Legacy store migration (msgpack commits/snapshots → git-header+JSON)
3 Tags and releases to plain JSON
4 Shelf entries to git-header+JSON
5 Stage index to plain JSON
6 All caches and indices to plain JSON
7 msgpack confined to wire-protocol files, lint-enforced

All acceptance criteria met. Closing.

gabriel 17 days ago

Issue #12 Audit — Complete

Full sweep of all source and test files for post-migration correctness. Changes made in 7 commits on this session:

Bugs found and fixed

** in two test files** — wrote msgpack bytes to a .json path. _read_tag_or_migrate swallowed the JSONDecodeError and returned None, so tests asserting result["valid"] is False passed vacuously ("tag not found" instead of actual signature verification failure). Fixed in tests/test_cmd_verify_tag.py and tests/test_verify_tag_supercharge.py — both now write valid JSON so signature verification actually runs.

Vacuously passing shelf/commit tests — several helpers wrote legacy msgpack blobs to old paths that the new readers never touch. Fixed in tests/test_object_store_write_taxonomy.py, tests/test_core_gc.py.

Wrong symlink testtests/test_security_symlink.py was testing the deleted _write_msgpack_atomic. Rewrote to cover the active write_shelf_entry path, including the correct error-message match pattern ("symlink" not "symbolic link").

Dead code deleted

_write_msgpack_atomic in muse/core/store.py — no production callers. Deleted (~62 lines). _validated_store_parents kept (still used at lines 673, 675, 707, 709).

Documentation / comments corrected

  • store.py module docstring and layout diagram: .msgpack.json for tags and releases; added legacy migration note
  • gc.py --full docstring: "Orphaned commit records (.muse/commits/*.msgpack)" → "Orphaned commit objects (.muse/objects/)"
  • tests/test_snapshot_schema_version_and_compression.py: "zstd(msgpack(data))" → "zstd(json(data))"
  • tests/test_integrity_I3_concurrent_race.py, test_integrity_I9_sigkill.py: removed _write_msgpack_atomic references from docstrings
  • tests/test_push_have_filter.py: module docstring and inline comments updated to reflect object store

Unused imports removed

import msgpack removed from: test_gc_corrupt_commit_object_retention.py, test_cmd_remaining.py, test_code_stage.py, test_indices.py, test_pull_missing_snapshot_guard.py, test_security_object_store_poisoning.py.

Test rewrites

  • test_snapshot_schema_version_and_compression.py: removed msgpack roundtrips, tests now use JSON
  • test_core_test_history.py: msgpack roundtrips replaced with JSON roundtrips
  • test_snapshot_delta.py: size-measurement payload uses JSON bytes
  • test_store_fsync_enospc.py: deleted TestWriteMsgpackAtomicFsync and TestWriteMsgpackAtomicDarwin classes; renumbered remaining sections 2–7
  • test_bridge_git_import.py: dry-run assertions now check object store empty rather than *.msgpack glob

Post-cleanup state

muse gc collected 2,288 orphaned objects (136 MB). One corrupt snapshot (sha256:c594b73...) persists — unreferenced by any of the 1,075 commits in history but considered reachable by GC (likely held by a shelf entry or coord record). Pre-existing; no action needed for this issue.

Wire-format msgpack allowlist (9 files) is unchanged and correct.