Closed #1

filed by gabriel human · 48 days ago

Per-repo object store: eliminate three-store sync (flat file + musehub_objects + commit graph)

0 Anchors

— Blast radius

— Churn 30d

0 Proposals

Problem

MuseHub currently maintains three separate stores that must stay perfectly in sync on every push:

Flat file store — /data/musehub/<sha256_hex> — 34,699 files in a single directory, globally namespaced across all repos
musehub_objects — DB mirror of the flat files (34,569 rows)
musehub_commits / musehub_snapshots / musehub_branches — parsed commit graph in the DB

Push negotiation (wire_negotiate) queries musehub_commits via SQL to determine what the remote already has. The object store is written separately. Any operation that rewrites commit IDs (migration, force-resign, partial push failure) leaves the three stores inconsistent, and the push negotiation then makes incorrect decisions — either under-sending (the bug we hit today) or over-sending.

This is the root cause of a whole class of bugs. The immediate symptom: after muse code migrate --force-resign rewrote all musehub commit IDs, the remote's DB had old IDs and the flat file store had a mix. The push walk stopped at a commit whose objects happened to be in the flat store but whose record wasn't in musehub_commits. Force push couldn't fix it.

What git forges do: GitHub/GitLab store bare repos on disk (repo.git/ with standard pack files). The DB holds only user metadata, access control, issues, and a search index. The repo data itself is never in the DB — the forge's API reads the on-disk git object store directly. A push is just writing objects to disk and advancing a ref file. There is no DB-object sync to break.

Target architecture

/data/repos/<owner>/<slug>/
  objects/
    sha256/                 ← algorithm namespace (mldsa65/ slots in here when PQ lands)
      ab/                   ← 2-char hex shard
        <62-hex>            ← object blob  (sha256 = 64 hex total; 2 consumed by shard dir)
  refs/
    heads/
      main                  ← contains: sha256:<64-hex>
      dev
  HEAD                      ← contains: ref: refs/heads/main

This mirrors .muse/objects/sha256/<shard>/<rest> exactly — the on-disk format is identical between the local client store and the server store.

DB tables that become caches only (rebuilt from disk, never canonical):

musehub_commits — fast graph queries, search
musehub_snapshots — fast manifest lookups
musehub_branches — fast branch listing

DB tables that remain canonical (not derivable from objects):

musehub_repos — repo metadata, visibility, owner
musehub_identities / musehub_auth_keys — identity/auth
musehub_issues / musehub_proposals / musehub_reviews — collaboration layer
musehub_objects — reduced to just storage_uri + size_bytes index for fetch path resolution

Implementation phases (load-bearing order)

Phase 1 — Per-repo directory isolation (unblocks everything else)

Goal: each repo gets its own object directory. Objects are no longer globally namespaced.

Changes:

LocalBackend._path(object_id) currently maps everything to /data/musehub/<safe_id>. Change to accept a repo_root parameter:

# Before
def _path(self, object_id: str) -> Path:
    return self._root / self._safe_id(object_id)

# After — Phase 1 (per-repo root, still flat; Phase 2 adds algo/shard)
def _path(self, object_id: str, repo_root: Path | None = None) -> Path:
    base = repo_root / "objects" if repo_root else self._root
    safe = self._safe_id(object_id)   # removed in Phase 2
    return base / safe

musehub_config.musehub_objects_dir becomes musehub_repos_dir (/data/repos) Repo root = /data/repos/<owner>/<slug>/ All wire push/fetch/presign paths pass repo_root through to the backend

Migration:

# One-time job: for each object in musehub_objects:
# 1. Read old path from storage_uri
# 2. Compute new path: /data/repos/<owner>/<slug>/objects/<safe_id>  (still flat here; Phase 2 reshards)
# 3. hardlink (same filesystem) or copy, then update storage_uri
# Note: one object may be referenced by multiple repos (shared blobs) —
# hardlinks are correct here; copies are safe but use more space.

Why first: all subsequent phases build on per-repo roots. Can't shard, can't add refs, can't make DB a cache until objects are per-repo.

Phase 2 — Object sharding (algo-namespaced + 2-char prefix)

Goal: eliminate flat-directory inode hell AND add algorithm namespacing. 34K files in one dir is already slow on some filesystems; at 1M it becomes a hard limit. The algo level (sha256/, mldsa65/, …) mirrors the local .muse/objects/ layout exactly and future-proofs for post-quantum object IDs with zero layout changes.

Changes:

def _path(self, object_id: str, repo_root: Path) -> Path:
    # "sha256:abcdef0123..." → objects/sha256/ab/cdef0123...
    algo, hex_part = object_id.split(":", 1)
    return repo_root / "objects" / algo / hex_part[:2] / hex_part[2:]

_safe_id can be deleted — the algo/shard/rest structure never produces paths with reserved characters; the colon never appears on disk

No other DB changes — storage_uri already points to the resolved path

Migration: rename files to <algo>/<shard>/<rest> paths, update storage_uri

Why second: the shard layout is a prerequisite for pack file support in Phase 5. Doing it before Phase 3 means the cache-rebuild logic in Phase 3 never sees the flat layout.

Phase 3 — On-disk refs as canonical branch pointers

Goal: branch heads live in refs/heads/<name> files on disk. DB musehub_branches.head_commit_id becomes a cache column.

Changes:

# On push: after objects are written, write refs/heads/<branch> atomically (rename-into-place)
ref_path = repo_root / "refs" / "heads" / branch_name
tmp = ref_path.with_suffix(".tmp")
tmp.write_text(f"{new_head_commit_id}\n")
tmp.rename(ref_path)   # atomic on POSIX
# Then update musehub_branches in the DB (cache write, not authoritative)

Add GET /repos/{owner}/{repo}/branches/{name}/repair — reads disk, heals DB if diverged Startup health-check: compare DB branch heads vs disk refs; log divergence

Why third: once refs are on disk, the DB can be treated as a cache for the first time. Push negotiation can fall back to disk when the DB is stale (fixing our immediate bug class).

Phase 4 — Push negotiation reads disk, not DB

Goal: wire_negotiate reads the on-disk commit graph, eliminating DB-drift bugs.

Current:

# wire_negotiate — queries musehub_commits via SQL
ack_q = await session.execute(
    select(db.MusehubCommit.commit_id).where(
        db.MusehubCommit.commit_id.in_(have_set), ...))

Target: read commit objects directly from the per-repo object store:

async def _commit_exists_on_disk(repo_root: Path, commit_id: str) -> bool:
    path = object_path(repo_root, commit_id)
    return path.exists()

# wire_negotiate — no DB query for have/want negotiation
ack = [cid for cid in have_set if await _commit_exists_on_disk(repo_root, cid)]

DB musehub_commits is still written on push (for fast graph queries and API search) But push negotiation never trusts it as the source of truth

Result: force-resign, migration, partial push — none can corrupt the negotiation

Why fourth: depends on per-repo roots (Phase 1) to know where to look. This phase directly fixes the class of bug that motivated this ticket.

Phase 5 — Pack file support + GC

Goal: periodically pack loose objects into pack files (like git pack-objects). Reduces inode count from O(objects) to O(1) per pack.

Changes:

Pack format: a sorted index file + a data file, mirroring the msgpack wire format already used in bundles
muse maintenance run --pack on the server-side repo triggers packing
Loose objects written by push; background job packs them (like git gc --auto)
LocalBackend.get() checks pack files when loose object not found
Pack files are immutable once written; GC deletes packs whose objects are all present in newer packs

Why fifth: purely a performance/scalability concern. No correctness dependency on earlier phases but requires the sharded layout from Phase 2.

Phase 6 — Storage tier formalisation

Goal: formalise the hot/warm/cold tiering that get_backend() already hints at.

Hot  — local disk (per-repo objects, recently pushed, < 30 days)
Warm — S3/R2 (objects older than 30 days, large blobs > 10 MB)
Cold — Glacier / archival (objects > 1 year, unpopular repos)

Changes:

StorageBackend protocol gains tier() -> Literal['hot','warm','cold']
get_backend() returns a TieredBackend that falls through hot → warm → cold
Background job promotes/demotes objects between tiers based on access time + age
LocalBackend = hot; S3Backend = warm (already implemented); new GlacierBackend = cold

Why last: pure operational concern, no correctness impact. Can be done incrementally per-repo after the storage layout is stable.

Acceptance criteria

Each repo has an isolated object directory under /data/repos/<owner>/<slug>/objects/
Objects are algo-namespaced and sharded: objects/sha256/ab/<62-hex>
Branch heads written atomically to refs/heads/<name> on disk on every push
wire_negotiate does not query musehub_commits for have/want resolution
Force-resign + re-push works without DB surgery
GET /repos/{owner}/{repo}/branches/{name}/repair heals DB from disk
Migration job moves all existing objects to per-repo sharded paths
Pack file GC runs as a background maintenance job
muse push local dev --force after a full force-resign succeeds in one command with no manual DB edits

◎ Activity1

●

gabriel opened this issue 48 days ago

○

gabriel 48 days ago

Implemented across six phases. All phases shipped and live on staging.

Phase 1 (per-repo isolation): sha256:af840c37d9e3f Phase 3 (on-disk refs): sha256:4d14406aff879 Phase 4 (disk-based push negotiation): sha256:b68da9c205ac4 Phase 5 (pack file + GC): sha256:699c1fcf16db4 Phase 6 (storage tier formalisation): sha256:386278accc3dd Wire-up (repo_root threading through push/fetch): sha256:627a1cfe4efcd sha256:cde72cee233cf sha256:5edef517efe89 Migration script (flat → per-repo, dry-run / migrate / verify / prune): sha256:f8d3be6b8476c

Architecture correction (algo-namespaced paths, _safe_id deletion) captured in the issue body.

Assignee

gabriel human

Release

no commits linked to this issue

create

muse hub issue create \
  --title "..." \
  --body "..." \
  --label bug \
  --anchor path/to/file.py::Symbol \
  --commit-anchor <sha> \
  --repo gabriel/musehub

read

muse hub issue get 1 --json
muse hub issue list --state open --json

update

muse hub issue edit 1 \
  --anchor path/to/file.py::Symbol \
  --repo gabriel/musehub

comment

muse hub issue comment 1 \
  --body "Fixed in <sha>" \
  --repo gabriel/musehub

reopen

muse hub issue reopen 1 \
  --repo gabriel/musehub

create

create_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  title: "...",
  body: "...",
  labels: ["bug"],
  symbol_anchors: [
    "path/to/file.py::Symbol"
  ],
  commit_anchors: ["<sha>"]
})

read

get_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 1
})

list_issues({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  state: "open"
})

update

edit_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 1,
  symbol_anchors: ["path/to/file.py::Symbol"]
})

comment

create_issue_comment({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 1,
  body: "..."
})

reopen

reopen_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 1
})

create

curl -X POST \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{
    "title": "...",
    "body": "...",
    "labels": ["bug"],
    "symbol_anchors": ["path/to/file.py::Symbol"]
  }'

read

# get one issue
curl http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/1

# list open issues
curl "http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues?state=open"

update

curl -X PATCH \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/1 \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{"title": "...", "body": "..."}'

comment

curl -X POST \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/1/comments \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{"body": "Fixed in <sha>"}'

reopen

curl -X POST \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/1/reopen \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\""