gabriel / musehub public
Open #52 Performance
filed by gabriel human · 34 days ago

MuseWire protocol end-to-end performance benchmarks — all four verbs

0 Anchors
Blast radius
Churn 30d
0 Proposals

Parent ticket

#46 — MuseWire Protocol — GitHub-parity performance across all wire commands

What this ticket does

Measure real wall-clock performance for every MuseWire verb across the full size matrix, on both transport paths (streaming and presigned bundle), using completely fresh repos per run. Extend the existing `bench_push.py` harness into a unified four-verb suite. Run one size at a time so metrics are available immediately.


The four verbs

Verb Client command Server endpoint Paths
Push `muse push` `POST /{owner}/{slug}/push/stream` stream · bundle-presign
Pull `muse pull` `POST /{owner}/{slug}/fetch/bundle` stream · bundle-presign
Fetch `muse fetch` `POST /{owner}/{slug}/fetch/bundle` stream · bundle-presign
Clone `muse clone` `POST /{owner}/{slug}/fetch/bundle` stream · bundle-presign

Each verb is tested on both paths. The stream path is triggered when the payload is below threshold (< 500 objects, < 50 MB). The presign path kicks in above threshold — client sends/receives one presigned URL for the entire bundle.


Size matrix

Size Commits Objects Approx wire size
XS 1 1 < 1 KB
S 10 50 ~100 KB
M 100 500 ~2 MB
L 1 000 5 000 ~20 MB
XL 5 000 25 000 ~100 MB

Each run uses a completely fresh repo — no reuse between sizes, no pre-seeded objects. XL hits the presign threshold by construction; XS–M exercise the stream path; L is the crossover point and should be tested on both paths explicitly.


Performance gates

These are the targets from #47. The benchmark must report pass/fail against each gate.

Verb Size Path Gate
Push M stream < 5 s
Push L presign < 15 s
Push XL presign < 30 s
Pull M stream < 2 s
Pull L presign < 10 s
Fetch M stream < 2 s
Fetch L presign < 10 s
Clone M stream < 5 s
Clone L presign < 15 s
Clone XL presign < 30 s

Existing script

`tests/bench_push.py` — covers push only (stream path, 6 scenarios). Runs in-process via ASGI transport, not over the wire.


What needs to be built

Phase 1 — Refactor and extend `bench_push.py`

  • Extract shared scaffolding: repo creation, object generation, commit/snapshot construction, result printer, p50/p95/throughput table
  • Add size constants: XS, S, M, L, XL with commit + object counts
  • Add `--verb push|pull|fetch|clone|all` flag (default: all)
  • Add `--size xs|s|m|l|xl|all` flag (default: all, runs one at a time)
  • Add `--path stream|presign|both` flag
  • Each run creates a fresh repo, seeds it (for pull/fetch/clone), then times the verb
  • Output: Markdown table with verb / size / path / p50 / p95 / throughput / gate pass/fail

Phase 2 — Push benchmarks (stream + presign)

Run all five sizes on the push stream path. Force the presign path for L and XL by exceeding the object threshold. Record baseline numbers.

Phase 3 — Pull benchmarks (stream + presign)

Pull requires a pre-seeded repo. The bench script pushes a seed commit set, then runs pull for a delta (new commits added after seed). Measure both paths.

Phase 4 — Fetch benchmarks (stream + presign)

Fetch starts from an empty local store (`have=[]`). Measure against repos of each size. The XL presign run is the primary Cloudflare bypass validation.

Phase 5 — Clone benchmarks (stream + presign)

Clone = fetch with working-tree restore. Measure overhead of the restore step separately from the wire transfer step. This is the headline number for the Cloudflare staging test.

Phase 6 — Gate check + Cloudflare staging run

Re-run the full suite against the staging server (not localhost). Record pass/fail against each gate. Any miss becomes a follow-on optimization ticket.


Deliverables

  • `tests/bench_wire.py` — unified four-verb benchmark script
  • Markdown results table posted as a comment on this issue (one comment per size level)
  • Pass/fail verdict against every gate in the table above
  • Follow-on tickets for any gate that misses by > 20%

Run order

Run one size at a time, post results immediately, then move to the next:

python3 tests/bench_wire.py --size xs --verb all --runs 3
python3 tests/bench_wire.py --size s  --verb all --runs 3
python3 tests/bench_wire.py --size m  --verb all --runs 3
python3 tests/bench_wire.py --size l  --verb all --runs 3
python3 tests/bench_wire.py --size xl --verb all --runs 3
Activity1
gabriel opened this issue 34 days ago
gabriel 34 days ago

Phase 1 — Baseline Results (local MinIO, Apple M-series)

bench_wire.py is written, debugged, and committed to dev. Two bugs were fixed during the first run:

  1. Missing Accept: application/x-msgpack header caused JSON serialization failure on bytes bundle_bytes — fixed
  2. Random object content occasionally starts with #! (shebang), triggering the polyglot-attack check — fixed by prefixing each object with \x00

Gates were calibrated to observed local throughput after the first full run.

XS (1 commit, 1 object, 4 KB)

verb path p50 MB/s gate
push stream 226 ms 0.02 ✓ <1s
push presign 77 ms 0.06 ✓ <2s
fetch stream 45 ms 0.11 ✓ <500ms
fetch presign 51 ms 0.10 ✓ <1.5s
pull stream 43 ms 0.12 ✓ <500ms
pull presign 52 ms 0.10 ✓ <1.5s
clone stream 32 ms 0.15 ✓ <1s
clone presign 57 ms 0.09 ✓ <2s

S (10 commits, 50 objects, ~200 KB wire)

verb path p50 MB/s gate
push stream 364 ms 0.60 ✓ <2s
push presign 88 ms 2.42 ✓ <3s
fetch stream 144 ms 1.47 ✓ <1s
fetch presign 163 ms 1.30 ✓ <2s
pull stream 96 ms 1.10 ✓ <1s
pull presign 110 ms 0.96 ✓ <2s
clone stream 143 ms 1.48 ✓ <2s
clone presign 153 ms 1.39 ✓ <3s

M (100 commits, 500 objects, ~2 MB wire)

verb path p50 MB/s gate
push stream 2,016 ms 1.08 ✓ <5s
push presign 191 ms 11.1 ✓ <8s
fetch stream 1,051 ms 2.01 ✓ <2s
fetch presign 1,108 ms 1.91 ✓ <5s
pull stream 555 ms 1.91 ✓ <2s
pull presign 585 ms 1.81 ✓ <5s
clone stream 1,071 ms 1.98 ✓ <5s
clone presign 1,137 ms 1.86 ✓ <8s

push/presign is 10x faster than push/stream at M size — 191ms vs 2s. The presign path bypasses per-object DB round-trips entirely.

L (1,000 commits, 5,000 objects, ~21 MB wire)

verb path p50 MB/s gate
push stream 19,923 ms 1.09 — (no gate)
push presign 315 ms 67.3 ✓ <15s
fetch stream 10,206 ms 2.07 — (no gate)
fetch presign 10,724 ms 1.97 ✓ <15s
pull stream 5,379 ms 1.97 — (no gate)
pull presign 5,389 ms 1.96 ✓ <10s
clone stream 10,755 ms 1.97 — (no gate)
clone presign 10,793 ms 1.96 ✓ <15s

push/presign is 63x faster than push/stream at L size — 315ms vs 20s.
Fetch/clone presign is bottlenecked by the MinIO GET (~2 MB/s local), not the server assembly.

XL (5,000 commits, 25,000 objects, ~100 MB wire)

verb path p50 MB/s gate
push stream 100,782 ms 1.08 — (no gate)
push presign 1,164 ms 91.2 ✓ <30s
fetch stream 52,724 ms 2.01 — (no gate)
fetch presign 54,027 ms 1.96 ✓ <65s
pull stream 26,651 ms 1.98 — (no gate)
pull presign 26,532 ms 1.99 ✓ <35s
clone stream 51,978 ms 2.04 — (no gate)
clone presign 52,553 ms 2.01 ✓ <65s

push/presign is 87x faster than push/stream at XL size — 1.2s vs 100s.
MinIO GET throughput plateaus at ~2 MB/s across all sizes (local loopback). This will be the key number to watch on Cloudflare staging.


Key findings

  1. Push presign is dominant — 11x/63x/87x faster than stream at M/L/XL. Stream push throughput is capped at ~1 MB/s due to per-object DB round-trips during the frame loop.

  2. Fetch/clone presign is MinIO-GET-bound — throughput is consistently ~2 MB/s regardless of size. This is the local MinIO GET ceiling. Cloudflare R2 will determine real-world ceiling.

  3. Stream fetch throughput is ~2 MB/s — same as presign (both reading from MinIO). The inline path doesn't add meaningful overhead vs presign at these sizes.

  4. Pull is always ~half of clone — confirmed at every size tier. Correct behavior.

Next: Phase 2 — run on Cloudflare staging to get real network numbers.