Bundle push security: size gates, content validation, and abuse defense
Which repo
MuseHub, not Muse. Security enforcement must be server-side — client-side checks are UX, not security. Any caller hitting the API directly bypasses the CLI entirely. All four phases live in musehub except minor client hints in Phase 4.
Attack surface summary
The bundle push path has four exploitable properties today:
- No size cap — presign accepts any
size_bytes. A 50 GB bundle uploads successfully. - Content is opaque until background job runs — illegal content lands in MinIO before any scan.
- Client-supplied counts are trusted —
commits_count=1with a 10M-object bundle passes the sync path and blows up the indexer. - No decompression guard — a 47 MB zstd bundle that expands to 4 GB crashes the background worker.
Phase 1 — Size gates (one sitting, zero risk)
These are the cheapest guards with the highest leverage. All changes in musehub/.
1a. Presign endpoint: reject oversized bundles before issuing a URL
musehub/musehub/api/routes/wire.py — push_bundle_presign:
_MAX_BUNDLE_BYTES = 512 * 1024 * 1024 # 512 MB — tune per tier
if size_bytes > _MAX_BUNDLE_BYTES:
raise HTTPException(
status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
detail=f"bundle size {size_bytes:,} bytes exceeds limit {_MAX_BUNDLE_BYTES:,}",
)
No DB access needed. Fires before the presigned URL is issued, so the client never gets a URL to abuse.
1b. Unpack-bundle: verify wire size after MinIO GET
musehub/musehub/services/musehub_wire.py — wire_push_unpack_bundle, after the MinIO GET:
if len(wire_bytes) > _MAX_BUNDLE_BYTES:
raise ValueError(
f"bundle {bundle_key[:20]} exceeds size limit: {len(wire_bytes):,} bytes"
)
Defends against a client that bypasses the presign check and PUTs directly to MinIO.
1c. Route layer: sanity-bound client-supplied counts
musehub/musehub/api/routes/wire.py — push_unpack_bundle:
_MAX_COMMITS_PER_PUSH = 50_000
_MAX_OBJECTS_PER_PUSH = 500_000
if commits_count > _MAX_COMMITS_PER_PUSH or objects_count > _MAX_OBJECTS_PER_PUSH:
raise HTTPException(status_code=422, detail="commits_count or objects_count exceeds limit")
These are logging inputs, not trusted counts — but bounding them prevents absurd log lines and future code that might use them for allocation.
1d. Move limits to config
All caps (_MAX_BUNDLE_BYTES, _MAX_COMMITS_PER_PUSH, _MAX_OBJECTS_PER_PUSH) should read from musehub/musehub/config.py so they can be tuned per environment (dev vs staging vs prod) without a code change.
Acceptance criteria:
- POST /push/bundle-presign with
size_bytes=600_000_000returns 413 - POST /push/unpack-bundle where MinIO blob exceeds cap raises 422
- All caps live in config, not hardcoded
Phase 2 — Bundle content validation (background job)
These run inside the bundle.index background worker, not the sync path. Failures quarantine the bundle rather than crash the worker.
2a. Decompression size guard (zip bomb)
When the background job unpacks objects, track cumulative decompressed bytes:
_MAX_DECOMPRESSED_BYTES = 4 * 1024 * 1024 * 1024 # 4 GB
total_decompressed = 0
for obj in raw_objects:
content = obj["content"]
encoding = obj.get("encoding", "")
if encoding == "zstd":
raw = zstd.decompress(content)
else:
raw = content
total_decompressed += len(raw)
if total_decompressed > _MAX_DECOMPRESSED_BYTES:
raise BundleValidationError("decompressed size exceeds limit — possible zip bomb")
2b. Actual object count verification
After unpacking, verify the actual counts match what the client declared:
if abs(len(raw_objects) - declared_objects_count) > 10: # small tolerance
raise BundleValidationError(
f"object count mismatch: declared {declared_objects_count}, actual {len(raw_objects)}"
)
2c. Per-object sha256 verification
Every object in the bundle: sha256(content) == object_id. Content-addressing is the proof — verify it here before writing to MinIO.
for obj in raw_objects:
oid = obj["object_id"]
_, expected_hex = split_id(oid)
actual_hex = hashlib.sha256(obj["content"]).hexdigest()
if actual_hex != expected_hex:
raise BundleValidationError(f"object {oid[:20]} content does not match declared id")
2d. Quarantine state
Add a quarantine boolean column to musehub_bundles (or equivalent tracking table). On any validation failure:
- Mark bundle as quarantined
- Move blob to a non-public MinIO bucket (
muse-quarantine) - Log the failure with full detail
- Do NOT advance the branch pointer or write objects to the main bucket
- Notify gabriel (email or webhook) — configurable
Acceptance criteria:
- A crafted bundle with mismatched object sha256 is quarantined, not indexed
- A zip bomb bundle is caught before full decompression
- Branch pointer is never advanced for a quarantined bundle
Phase 3 — Content scanning
This is the legally load-bearing phase. Required before any public launch that allows arbitrary user content.
3a. Known-hash blocklist
Maintain a musehub_blocked_hashes table (object_id text PRIMARY KEY). Before writing any object to MinIO, check this list. Seed it from:
- NCMEC hash lists (CSAM)
- Project VIC
- Internal DMCA takedown history
blocked = await session.execute(
select(db.MusehubBlockedHash.object_id)
.where(db.MusehubBlockedHash.object_id.in_(all_object_ids))
)
if blocked.scalars().all():
raise BundleValidationError("bundle contains blocked content")
O(1) per object against an indexed table — negligible cost.
3b. CSAM scanning integration
For image/binary objects above a size threshold, enqueue a secondary scan job that calls an external CSAM API (Microsoft PhotoDNA or equivalent). Objects stay in quarantine bucket until scan clears.
Architecture:
bundle.index job
→ writes objects to muse-quarantine (not muse-objects)
→ enqueues content.scan job per binary object
content.scan job
→ calls CSAM API
→ on PASS: moves object from muse-quarantine → muse-objects
→ on FAIL: keeps in quarantine, fires alert, suspends repo
3c. DMCA takedown infrastructure
Add POST /admin/takedown endpoint that:
- Adds object_id(s) to
musehub_blocked_hashes - Moves existing MinIO objects to quarantine
- Marks affected repos with a
dmca_holdflag - Returns a takedown receipt
Acceptance criteria:
- An object matching a blocked hash is rejected at index time
- DMCA takedown endpoint removes object from public access within one request
- CSAM scan failure suspends the repo and fires an alert
Phase 4 — Rate limiting and abuse detection
4a. Per-user bundle rate limits
In musehub/musehub/api/routes/wire.py, tighten WIRE_PUSH_LIMIT for the bundle endpoints specifically. Track per-user daily bundle upload bytes in Redis (or pg) and reject when exceeded.
4b. Anomaly detection
Background job: compare current push volume (bytes, objects, commits) against the user's 30-day rolling average. If 10x above average, flag for review rather than auto-reject.
4c. Client-side hints (the one Muse CLI piece)
muse/muse/cli/commands/push.py — before presigning, check bundle size against the server's declared limit (returned in a /caps endpoint or in the presign error). Print a clear message:
❌ Bundle too large: 600 MB exceeds server limit of 512 MB.
Split your push into smaller increments or contact [email protected].
This is UX, not enforcement. The server still enforces in Phase 1.
Implementation order
Phase 1 is one sitting and should ship immediately — it closes the most obvious DoS vector with near-zero risk.
Phase 2 ships with the bundle.index background worker (already planned).
Phase 3 is gated on legal review of which CSAM API to use and which jurisdictions require it. Do not launch public signups without at least 3a (known-hash blocklist).
Phase 4 is operational hardening — do it after the platform has real traffic data to set sensible thresholds.
Files touched
| Phase | File |
|---|---|
| 1a, 1c | musehub/musehub/api/routes/wire.py |
| 1b | musehub/musehub/services/musehub_wire.py |
| 1d | musehub/musehub/config.py |
| 2a–2d | musehub/musehub/workers/bundle_index.py (new) |
| 2d | musehub/musehub/db/musehub_models.py (quarantine column) |
| 3a | musehub/musehub/db/musehub_models.py (blocked hashes table) |
| 3b | musehub/musehub/workers/content_scan.py (new) |
| 3c | musehub/musehub/api/routes/admin.py |
| 4c | muse/muse/cli/commands/push.py |
Phase 2 complete — bundle content validation
All 8 tests passing in test_bundle_validation_phase2.py.
What was added
BundleValidationError — new terminal exception class; quarantined bundles are never retried.
Three validation checks in process_bundle_index_job, all running before any DB or MinIO writes:
| Check | Issue ref | What it catches |
|---|---|---|
| Object count mismatch | 2b | abs(actual - declared_objects_count) > 10 |
| Decompression size guard | 2a | cumulative_decompressed > bundle_max_decompressed_bytes (default 4 GB) |
| Per-object sha256 | 2c | sha256(decompressed_content) != object_id |
Quarantine mechanics:
quarantine_reasoncolumn onMusehubBackgroundJob(nullable text)quarantine_job(session, job_id, reason)inmusehub_jobs— setsstatus='quarantined',done_at,quarantine_reasonprocess_bundle_index_jobmarks the job rowstatus='quarantined'in-session before raising; caller commits to persistquarantine_bundle(bundle_key)onBlobBackend— copies bundle frombundles/toquarantine/prefix, deletes original (best-effort)
Config: bundle_max_decompressed_bytes: int = 4 * 1024 * 1024 * 1024 (4 GB default)
Job payload: declared_objects_count and declared_commits_count now stored in the bundle.index job payload for the count mismatch check.
Tests (8 new)
test_hash_mismatch_raises_bundle_validation_errortest_zip_bomb_raises_bundle_validation_errortest_count_mismatch_raises_bundle_validation_errortest_validation_failure_writes_no_objects_to_minio— spies onbackend.put; zero calls on validation failuretest_validation_failure_writes_no_commits_to_dbtest_quarantine_job_sets_status_and_reasontest_process_bundle_index_job_sets_quarantined_on_validation_error— catch + commit →status='quarantined'test_valid_bundle_passes_all_validation_checks— regression: clean bundle still indexes normally
Phase 3 (content scanning / blocked-hash lookup) remains open.
Phase 3 complete — content scanning and DMCA takedown
All 10 tests passing in test_bundle_content_scanning_phase3.py. 30/30 total across all bundle phases.
3a — Known-hash blocklist
MusehubBlockedHash table (object_id PK, reason, added_by, added_at). Seeded by DMCA takedown or manual import from NCMEC/Project VIC lists.
Check runs in process_bundle_index_job after Phase 2 validation, before any MinIO writes — a single IN query against an indexed table:
blocked = select(MusehubBlockedHash.object_id).where(object_id.in_(all_oids))
if blocked:
raise BundleValidationError(f"bundle contains {len(blocked)} blocked object(s)...")
Error reports all matched IDs. Job status → quarantined.
3b — content.scan job infrastructure
After every successful bundle.index, a content.scan job is enqueued per indexed object (payload={object_id, repo_id, bundle_key}). The processor stub always returns clean — a real CSAM API (PhotoDNA, etc.) drops in via the CSAM_API_URL config once legal review completes. No architectural changes needed at integration time.
3c — DMCA takedown
POST /api/admin/takedown (admin-only, 403 for non-admin):
| Input | Effect |
|---|---|
object_ids: [sha256:...] |
Upserted into musehub_blocked_hashes (idempotent) |
| Objects already in MinIO | Moved to quarantine/ prefix via quarantine_object() |
repo_ids: [sha256:...] |
dmca_hold=True set on each repo |
Response: {blocked_count, repos_held, quarantined_count}
dmca_hold column added to musehub_repos — push gates and serve paths can enforce it.
Tests (10 new)
test_blocked_object_quarantines_bundletest_blocked_check_fires_before_minio_puts— spy onbackend.put, zero callstest_multiple_blocked_objects_all_reportedtest_clean_bundle_bypasses_blocklist_unimpeded— regressiontest_dmca_takedown_adds_hashes_to_blocklisttest_dmca_takedown_marks_repos_dmca_holdtest_dmca_takedown_quarantines_existing_minio_objectstest_dmca_takedown_requires_admintest_dmca_takedown_idempotenttest_content_scan_jobs_enqueued_after_indexing
Phase 4 (rate limiting / anomaly detection) remains open.
Phase 4 complete ✅
Per-user daily byte limits, anomaly detection, and /api/caps — 15 tests, all green.
What was built
4a — Per-user daily byte limit
- New table
musehub_daily_push_bytes (identity_id, date, bytes_uploaded)— upserted atbundle-presigntime viaON CONFLICT DO UPDATE … + size_bytes bundle-presignchecks the running daily total before issuing a URL; responds 429 with"daily upload limit … reached; try again tomorrow"when the user is at or overbundle_daily_upload_limit_bytes- New config setting
bundle_daily_upload_limit_bytes(default 50 GB); set to 0 to disable - Limit is per-user — exhausting one user's quota never blocks another
record_bundle_bytes_uploaded(session, identity_id, size_bytes)— reusable service helper
4b — Anomaly detection
- New table
musehub_push_anomalies (anomaly_id, identity_id, bytes_today, rolling_avg_bytes, ratio, detected_at) check_push_anomaly(session, identity_id, bytes_today)— computes 30-day rolling average frommusehub_daily_push_bytes; if today's upload is >10× the average, inserts amusehub_push_anomalyrow and logs a structuredWARNING- Push is never rejected — detection is advisory only
- No history → no flag (first-ever push is always clean)
4c — GET /api/caps
- New public endpoint
musehub/api/routes/api/caps.py - Returns
{ max_bundle_bytes, daily_upload_limit_bytes, max_commits_per_push, max_objects_per_push } - No auth required — Muse CLI calls this as a pre-flight hint
Tests
tests/test_bundle_rate_limiting_phase4.py — 15 tests, all passing alongside the 18 from phases 2 and 3.
Phase 1 complete — size gates shipped
All three gates are live on dev.
What shipped
Gate 1 — presign rejects oversized bundles (413)
POST /push/bundle-presignwithsize_bytes > bundle_max_bytesreturns 413 before a presigned URL is ever issued. Client never gets a URL to abuse.Gate 2 — unpack-bundle rejects oversized wire bytes (422) After the MinIO GET,
len(wire_bytes) > bundle_max_bytesraises 422. Defends against a client that bypassed presign and PUT directly to MinIO.Gate 3 — count bounds on client-supplied values (422)
commits_countorobjects_countabove cap → 422 at the route layer, before MinIO is touched.All caps in
Settings(config.py) — tunable per environment:TDD:
tests/test_bundle_size_gates.py— 6 tests, all green.Phase 2 is next — bundle content validation (background job)
Phase 2 is blocked on the
bundle.indexbackground worker, which is not yet implemented. That worker is the next milestone (it is also required for fetch/clone/pull to work correctly). Phase 2 gates will be implemented TDD inside the worker as part of that milestone.Phase 2 checklist when the worker lands:
len(raw_objects)within tolerance ofobjects_countdeclared at presign timesha256(content) == object_idfor every object before writing to MinIOmuse-quarantinebucket, does not advance branch pointer, fires alert