gabriel / musehub public
Open #92
filed by gabriel human · 2 days ago

Pre-build fetch mpacks on push to eliminate clone timeout for large repos

0 Anchors
Blast radius
Churn 30d
0 Proposals

Background

wire_fetch_mpack builds the fetch mpack synchronously during a fetch request: it loads all blob bytes from R2, assembles the binary mpack, and uploads it back to R2. For large repos (gabriel/muse has 1329 commits, 1075 blobs, ~22MB), this exceeds Cloudflare's 100s origin timeout, producing HTTP 524. muse clone is impossible for any repo of this size.

The fix: pre-build the fetch mpack in the background after every push. On fetch, the handler checks the cache first. A hit is a DB lookup + one presign call — sub-second regardless of repo size. A miss falls back to the existing synchronous build path.

This unblocks cloning gabriel/muse and every future repo that grows past ~200 commits.

Goal

  • muse clone https://staging.musehub.ai/gabriel/muse completes without HTTP 524
  • Fetch response time for a cached tip is under 2s regardless of repo size
  • Cache is populated automatically after every push; no manual steps required
  • Existing fetch path is unchanged for cache misses (no regression)

Phases

Phase 1 — Schema ✅

  • Migration: add musehub_fetch_mpack_cache table
    • cache_id VARCHAR PK (content-addressed: sha256(repo_id + tip_commit_id))
    • repo_id VARCHAR FK → musehub_repos
    • tip_commit_id VARCHAR — the want[0] this mpack was built for
    • mpack_id VARCHAR — R2 key of the pre-built fetch mpack
    • created_at TIMESTAMPTZ
    • expires_at TIMESTAMPTZ — default now + 7 days; GC deletes after expiry
    • unique index on (repo_id, tip_commit_id) — FMC_01
  • SQLAlchemy model MusehubFetchMPackCache in musehub_repo_models.py — FMC_02
  • test_migrations.py: update _ALL_REVISIONS and _HEAD — FMC_03

Phase 2 — Job handler ✅

  • process_fetch_mpack_prebuild_job(session, job_id) in musehub_wire_fetch.py — FMC_04
    • Payload: { "repo_id": str, "tip_commit_ids": [str] } (all branch tips at push time)
    • For each tip: check MusehubFetchMPackCache; skip if fresh entry exists
    • Call wire_fetch_mpack(session, repo_id, want=[tip], have=[]) — reuses existing build path
    • Write (or upsert) MusehubFetchMPackCache row with returned mpack_id
    • Log: tip count, hits (skipped), misses (built), total elapsed — FMC_05
  • Worker dispatch in worker.py: add fetch.mpack.prebuild case — FMC_06
  • Unit test: mock wire_fetch_mpack, confirm cache rows written for each tip, confirm skip on hit — FMC_07
  • Integration test: push to a test repo, verify MusehubFetchMPackCache row exists after worker runs — FMC_08

Phase 3 — Cache lookup in wire_fetch_mpack ✅

  • At the top of wire_fetch_mpack: if len(want) == 1 and have == [], query MusehubFetchMPackCache for (repo_id, want[0]) — FMC_09
  • Hit path: backend.presign_mpack_get(cached.mpack_id, ttl_seconds), return immediately — FMC_10
  • Miss path: build as today; after successful upload, write MusehubFetchMPackCache row so the next fetch is a hit — FMC_11
  • Add logger.warning timing: [wire_fetch_mpack] cache=HIT|MISS tip=... t=...ms — FMC_12
  • Unit test: cache hit returns presigned URL without calling blob-load path — FMC_13
  • Unit test: cache miss builds and writes cache row — FMC_14

Phase 4 — Enqueue on push ✅

  • Add "fetch.mpack.prebuild" to job_types_for_push in musehub_intel_providers.py — FMC_15
  • Payload: collect all current branch tip commit IDs for the repo at push time, pass as tip_commit_ids list — FMC_16
  • _FOUNDATION_TYPES in musehub_jobs.py: add "fetch.mpack.prebuild" so it is prioritized alongside mpack.index — FMC_17
  • Integration test: push to a repo, confirm fetch.mpack.prebuild job is enqueued with correct payload — FMC_18

Phase 5 — GC ✅

  • process_gc_job (or the existing gc handler): delete MusehubFetchMPackCache rows where expires_at < now() and call backend.delete(mpack_id) for each — FMC_19
  • Unit test: expired rows deleted, R2 objects removed; fresh rows untouched — FMC_20

Acceptance Criteria

  • rm -rf /tmp/muse-clone-test && muse clone https://staging.musehub.ai/gabriel/muse /tmp/muse-clone-test completes in under 60s with no HTTP 5xx
  • After a push to any branch, MusehubFetchMPackCache has a row for every branch tip within worker poll interval (~10s)
  • Fetching when cache is warm: wire_fetch_mpack returns in under 2s ([wire_fetch_mpack] cache=HIT in logs)
  • No regression: existing wire_fetch_mpack behavior is identical when have != [] or cache is cold

Out of Scope

  • Caching incremental fetches (have != []) — only fresh clones (have=[]) are cached in this issue
  • Streaming or chunked mpack download — mpack is downloaded as a single presigned GET
  • Pre-building on branch creation or force-push — only normal push triggers the job
Activity
gabriel opened this issue 2 days ago
No activity yet. Use the CLI to comment.