Per-repo object store: eliminate three-store sync (flat file + musehub_objects + commit graph)
Problem
MuseHub currently maintains three separate stores that must stay perfectly in sync on every push:
- Flat file store —
/data/musehub/<sha256_hex>— 34,699 files in a single directory, globally namespaced across all repos - musehub_objects — DB mirror of the flat files (34,569 rows)
- musehub_commits / musehub_snapshots / musehub_branches — parsed commit graph in the DB
Push negotiation (wire_negotiate) queries musehub_commits via SQL to determine what the remote already has. The object store is written separately. Any operation that rewrites commit IDs (migration, force-resign, partial push failure) leaves the three stores inconsistent, and the push negotiation then makes incorrect decisions — either under-sending (the bug we hit today) or over-sending.
This is the root cause of a whole class of bugs. The immediate symptom: after muse code migrate --force-resign rewrote all musehub commit IDs, the remote's DB had old IDs and the flat file store had a mix. The push walk stopped at a commit whose objects happened to be in the flat store but whose record wasn't in musehub_commits. Force push couldn't fix it.
What git forges do: GitHub/GitLab store bare repos on disk (repo.git/ with standard pack files). The DB holds only user metadata, access control, issues, and a search index. The repo data itself is never in the DB — the forge's API reads the on-disk git object store directly. A push is just writing objects to disk and advancing a ref file. There is no DB-object sync to break.
Target architecture
/data/repos/<owner>/<slug>/
objects/
sha256/ ← algorithm namespace (mldsa65/ slots in here when PQ lands)
ab/ ← 2-char hex shard
<62-hex> ← object blob (sha256 = 64 hex total; 2 consumed by shard dir)
refs/
heads/
main ← contains: sha256:<64-hex>
dev
HEAD ← contains: ref: refs/heads/main
This mirrors .muse/objects/sha256/<shard>/<rest> exactly — the on-disk format is identical between the local client store and the server store.
DB tables that become caches only (rebuilt from disk, never canonical):
musehub_commits— fast graph queries, searchmusehub_snapshots— fast manifest lookupsmusehub_branches— fast branch listing
DB tables that remain canonical (not derivable from objects):
musehub_repos— repo metadata, visibility, ownermusehub_identities/musehub_auth_keys— identity/authmusehub_issues/musehub_proposals/musehub_reviews— collaboration layermusehub_objects— reduced to juststorage_uri+size_bytesindex for fetch path resolution
Implementation phases (load-bearing order)
Phase 1 — Per-repo directory isolation (unblocks everything else)
Goal: each repo gets its own object directory. Objects are no longer globally namespaced.
Changes:
LocalBackend._path(object_id) currently maps everything to /data/musehub/<safe_id>. Change to accept a repo_root parameter:
# Before
def _path(self, object_id: str) -> Path:
return self._root / self._safe_id(object_id)
# After — Phase 1 (per-repo root, still flat; Phase 2 adds algo/shard)
def _path(self, object_id: str, repo_root: Path | None = None) -> Path:
base = repo_root / "objects" if repo_root else self._root
safe = self._safe_id(object_id) # removed in Phase 2
return base / safe
musehub_config.musehub_objects_dir becomes musehub_repos_dir (/data/repos)
Repo root = /data/repos/<owner>/<slug>/
All wire push/fetch/presign paths pass repo_root through to the backend
Migration:
# One-time job: for each object in musehub_objects:
# 1. Read old path from storage_uri
# 2. Compute new path: /data/repos/<owner>/<slug>/objects/<safe_id> (still flat here; Phase 2 reshards)
# 3. hardlink (same filesystem) or copy, then update storage_uri
# Note: one object may be referenced by multiple repos (shared blobs) —
# hardlinks are correct here; copies are safe but use more space.
Why first: all subsequent phases build on per-repo roots. Can't shard, can't add refs, can't make DB a cache until objects are per-repo.
Phase 2 — Object sharding (algo-namespaced + 2-char prefix)
Goal: eliminate flat-directory inode hell AND add algorithm namespacing. 34K files in one dir is already slow on some filesystems; at 1M it becomes a hard limit. The algo level (sha256/, mldsa65/, …) mirrors the local .muse/objects/ layout exactly and future-proofs for post-quantum object IDs with zero layout changes.
Changes:
def _path(self, object_id: str, repo_root: Path) -> Path:
# "sha256:abcdef0123..." → objects/sha256/ab/cdef0123...
algo, hex_part = object_id.split(":", 1)
return repo_root / "objects" / algo / hex_part[:2] / hex_part[2:]
_safe_id can be deleted — the algo/shard/rest structure never produces paths with reserved characters; the colon never appears on disk
No other DB changes — storage_uri already points to the resolved path
Migration: rename files to <algo>/<shard>/<rest> paths, update storage_uri
Why second: the shard layout is a prerequisite for pack file support in Phase 5. Doing it before Phase 3 means the cache-rebuild logic in Phase 3 never sees the flat layout.
Phase 3 — On-disk refs as canonical branch pointers
Goal: branch heads live in refs/heads/<name> files on disk. DB musehub_branches.head_commit_id becomes a cache column.
Changes:
# On push: after objects are written, write refs/heads/<branch> atomically (rename-into-place)
ref_path = repo_root / "refs" / "heads" / branch_name
tmp = ref_path.with_suffix(".tmp")
tmp.write_text(f"{new_head_commit_id}\n")
tmp.rename(ref_path) # atomic on POSIX
# Then update musehub_branches in the DB (cache write, not authoritative)
Add GET /repos/{owner}/{repo}/branches/{name}/repair — reads disk, heals DB if diverged
Startup health-check: compare DB branch heads vs disk refs; log divergence
Why third: once refs are on disk, the DB can be treated as a cache for the first time. Push negotiation can fall back to disk when the DB is stale (fixing our immediate bug class).
Phase 4 — Push negotiation reads disk, not DB
Goal: wire_negotiate reads the on-disk commit graph, eliminating DB-drift bugs.
Current:
# wire_negotiate — queries musehub_commits via SQL
ack_q = await session.execute(
select(db.MusehubCommit.commit_id).where(
db.MusehubCommit.commit_id.in_(have_set), ...))
Target: read commit objects directly from the per-repo object store:
async def _commit_exists_on_disk(repo_root: Path, commit_id: str) -> bool:
path = object_path(repo_root, commit_id)
return path.exists()
# wire_negotiate — no DB query for have/want negotiation
ack = [cid for cid in have_set if await _commit_exists_on_disk(repo_root, cid)]
DB musehub_commits is still written on push (for fast graph queries and API search)
But push negotiation never trusts it as the source of truth
Result: force-resign, migration, partial push — none can corrupt the negotiation
Why fourth: depends on per-repo roots (Phase 1) to know where to look. This phase directly fixes the class of bug that motivated this ticket.
Phase 5 — Pack file support + GC
Goal: periodically pack loose objects into pack files (like git pack-objects). Reduces inode count from O(objects) to O(1) per pack.
Changes:
- Pack format: a sorted index file + a data file, mirroring the msgpack wire format already used in bundles
muse maintenance run --packon the server-side repo triggers packing- Loose objects written by push; background job packs them (like
git gc --auto) LocalBackend.get()checks pack files when loose object not found- Pack files are immutable once written; GC deletes packs whose objects are all present in newer packs
Why fifth: purely a performance/scalability concern. No correctness dependency on earlier phases but requires the sharded layout from Phase 2.
Phase 6 — Storage tier formalisation
Goal: formalise the hot/warm/cold tiering that get_backend() already hints at.
Hot — local disk (per-repo objects, recently pushed, < 30 days)
Warm — S3/R2 (objects older than 30 days, large blobs > 10 MB)
Cold — Glacier / archival (objects > 1 year, unpopular repos)
Changes:
StorageBackendprotocol gainstier() -> Literal['hot','warm','cold']get_backend()returns aTieredBackendthat falls through hot → warm → cold- Background job promotes/demotes objects between tiers based on access time + age
LocalBackend= hot;S3Backend= warm (already implemented); newGlacierBackend= cold
Why last: pure operational concern, no correctness impact. Can be done incrementally per-repo after the storage layout is stable.
Acceptance criteria
- Each repo has an isolated object directory under
/data/repos/<owner>/<slug>/objects/ - Objects are algo-namespaced and sharded:
objects/sha256/ab/<62-hex> - Branch heads written atomically to
refs/heads/<name>on disk on every push wire_negotiatedoes not querymusehub_commitsfor have/want resolution- Force-resign + re-push works without DB surgery
GET /repos/{owner}/{repo}/branches/{name}/repairheals DB from disk- Migration job moves all existing objects to per-repo sharded paths
- Pack file GC runs as a background maintenance job
muse push local dev --forceafter a full force-resign succeeds in one command with no manual DB edits
Implemented across six phases. All phases shipped and live on staging.
Phase 1 (per-repo isolation): sha256:af840c37d9e3f Phase 3 (on-disk refs): sha256:4d14406aff879 Phase 4 (disk-based push negotiation): sha256:b68da9c205ac4 Phase 5 (pack file + GC): sha256:699c1fcf16db4 Phase 6 (storage tier formalisation): sha256:386278accc3dd Wire-up (repo_root threading through push/fetch): sha256:627a1cfe4efcd sha256:cde72cee233cf sha256:5edef517efe89 Migration script (flat → per-repo, dry-run / migrate / verify / prune): sha256:f8d3be6b8476c
Architecture correction (algo-namespaced paths, _safe_id deletion) captured in the issue body.