perf: eliminate MinIO ghost-check from push/stream hot path
Phase 3b was doing asyncio.gather(backend.exists()) against every object referenced by new snapshots — 8791 MinIO HEAD requests at ~22ms each with Semaphore(50), taking ~197s per push.
Root cause: fighting the content-addressed guarantee instead of trusting it. Every object_id is sha256:<hex> — the ID IS the content. push/confirm already verified storage existence before writing musehub_objects. Doing it again in push/stream is redundant and architecturally incorrect.
Fix: - Extract _check_missing_objects(session, needs_check) — one IN query, no MinIO. Returns object_ids absent from musehub_objects entirely. - Phase 3b: reject only objects not registered in the DB. Objects in musehub_objects are trusted; storage availability is a read-time concern and background-job responsibility, not a push gate. - Phase 7 snapshots: remove per-batch session.commit() — single atomic commit in phase 10 covers snapshots + commits + branch update. - Phase 8 commits: same — remove per-batch session.commit(). - Migration 0052: index on musehub_object_refs(object_id) for bulk lookups.
Per-phase timing before → after: phase 3b (ghost check): 197s → ~1.8s (one DB IN query) phase 7 (snapshots): 6.5s → ~0.7s (no per-batch commit) phase 8 (commits): 14.7s → ~5s (no per-batch commit)
Tests: 6 new property tests in test_push_stream_ghost_skip.py (P1–P6). Updated C7 and i2 to reflect correct architecture.
0 comments
muse hub commit comment sha256:ca1ebb9cb6e4673fa0b75a7c53bab235ef61adaa7ae38885f591b5ad23c66378 --body "your comment"
No comments yet. Be the first to start the discussion.