Open
#92
Pre-build fetch mpacks on push to eliminate clone timeout for large repos
0
Anchors
—
Blast radius
—
Churn 30d
0
Proposals
Background
wire_fetch_mpack builds the fetch mpack synchronously during a fetch request: it loads all blob bytes from R2, assembles the binary mpack, and uploads it back to R2. For large repos (gabriel/muse has 1329 commits, 1075 blobs, ~22MB), this exceeds Cloudflare's 100s origin timeout, producing HTTP 524. muse clone is impossible for any repo of this size.
The fix: pre-build the fetch mpack in the background after every push. On fetch, the handler checks the cache first. A hit is a DB lookup + one presign call — sub-second regardless of repo size. A miss falls back to the existing synchronous build path.
This unblocks cloning gabriel/muse and every future repo that grows past ~200 commits.
Goal
muse clone https://staging.musehub.ai/gabriel/musecompletes without HTTP 524- Fetch response time for a cached tip is under 2s regardless of repo size
- Cache is populated automatically after every push; no manual steps required
- Existing fetch path is unchanged for cache misses (no regression)
Phases
Phase 1 — Schema ✅
- Migration: add
musehub_fetch_mpack_cachetablecache_idVARCHAR PK (content-addressed:sha256(repo_id + tip_commit_id))repo_idVARCHAR FK →musehub_repostip_commit_idVARCHAR — thewant[0]this mpack was built formpack_idVARCHAR — R2 key of the pre-built fetch mpackcreated_atTIMESTAMPTZexpires_atTIMESTAMPTZ — defaultnow + 7 days; GC deletes after expiry- unique index on
(repo_id, tip_commit_id)— FMC_01
- SQLAlchemy model
MusehubFetchMPackCacheinmusehub_repo_models.py— FMC_02 test_migrations.py: update_ALL_REVISIONSand_HEAD— FMC_03
Phase 2 — Job handler ✅
process_fetch_mpack_prebuild_job(session, job_id)inmusehub_wire_fetch.py— FMC_04- Payload:
{ "repo_id": str, "tip_commit_ids": [str] }(all branch tips at push time) - For each tip: check
MusehubFetchMPackCache; skip if fresh entry exists - Call
wire_fetch_mpack(session, repo_id, want=[tip], have=[])— reuses existing build path - Write (or upsert)
MusehubFetchMPackCacherow with returnedmpack_id - Log: tip count, hits (skipped), misses (built), total elapsed — FMC_05
- Payload:
- Worker dispatch in
worker.py: addfetch.mpack.prebuildcase — FMC_06 - Unit test: mock
wire_fetch_mpack, confirm cache rows written for each tip, confirm skip on hit — FMC_07 - Integration test: push to a test repo, verify
MusehubFetchMPackCacherow exists after worker runs — FMC_08
Phase 3 — Cache lookup in wire_fetch_mpack ✅
- At the top of
wire_fetch_mpack: iflen(want) == 1andhave == [], queryMusehubFetchMPackCachefor(repo_id, want[0])— FMC_09 - Hit path:
backend.presign_mpack_get(cached.mpack_id, ttl_seconds), return immediately — FMC_10 - Miss path: build as today; after successful upload, write
MusehubFetchMPackCacherow so the next fetch is a hit — FMC_11 - Add
logger.warningtiming:[wire_fetch_mpack] cache=HIT|MISS tip=... t=...ms— FMC_12 - Unit test: cache hit returns presigned URL without calling blob-load path — FMC_13
- Unit test: cache miss builds and writes cache row — FMC_14
Phase 4 — Enqueue on push ✅
- Add
"fetch.mpack.prebuild"tojob_types_for_pushinmusehub_intel_providers.py— FMC_15 - Payload: collect all current branch tip commit IDs for the repo at push time, pass as
tip_commit_idslist — FMC_16 _FOUNDATION_TYPESinmusehub_jobs.py: add"fetch.mpack.prebuild"so it is prioritized alongsidempack.index— FMC_17- Integration test: push to a repo, confirm
fetch.mpack.prebuildjob is enqueued with correct payload — FMC_18
Phase 5 — GC ✅
process_gc_job(or the existinggchandler): deleteMusehubFetchMPackCacherows whereexpires_at < now()and callbackend.delete(mpack_id)for each — FMC_19- Unit test: expired rows deleted, R2 objects removed; fresh rows untouched — FMC_20
Acceptance Criteria
rm -rf /tmp/muse-clone-test && muse clone https://staging.musehub.ai/gabriel/muse /tmp/muse-clone-testcompletes in under 60s with no HTTP 5xx- After a push to any branch,
MusehubFetchMPackCachehas a row for every branch tip within worker poll interval (~10s) - Fetching when cache is warm:
wire_fetch_mpackreturns in under 2s ([wire_fetch_mpack] cache=HITin logs) - No regression: existing
wire_fetch_mpackbehavior is identical whenhave != []or cache is cold
Out of Scope
- Caching incremental fetches (
have != []) — only fresh clones (have=[]) are cached in this issue - Streaming or chunked mpack download — mpack is downloaded as a single presigned GET
- Pre-building on branch creation or force-push — only normal push triggers the job
Activity
No activity yet. Use the CLI to comment.