gabriel / musehub public
Closed #49 Enhancement security
filed by gabriel human · 40 days ago

Bundle push security: size gates, content validation, and abuse defense

0 Anchors
Blast radius
Churn 30d
0 Proposals

Which repo

MuseHub, not Muse. Security enforcement must be server-side — client-side checks are UX, not security. Any caller hitting the API directly bypasses the CLI entirely. All four phases live in musehub except minor client hints in Phase 4.


Attack surface summary

The bundle push path has four exploitable properties today:

  1. No size cap — presign accepts any size_bytes. A 50 GB bundle uploads successfully.
  2. Content is opaque until background job runs — illegal content lands in MinIO before any scan.
  3. Client-supplied counts are trustedcommits_count=1 with a 10M-object bundle passes the sync path and blows up the indexer.
  4. No decompression guard — a 47 MB zstd bundle that expands to 4 GB crashes the background worker.

Phase 1 — Size gates (one sitting, zero risk)

These are the cheapest guards with the highest leverage. All changes in musehub/.

1a. Presign endpoint: reject oversized bundles before issuing a URL

musehub/musehub/api/routes/wire.pypush_bundle_presign:

_MAX_BUNDLE_BYTES = 512 * 1024 * 1024  # 512 MB — tune per tier

if size_bytes > _MAX_BUNDLE_BYTES:
    raise HTTPException(
        status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
        detail=f"bundle size {size_bytes:,} bytes exceeds limit {_MAX_BUNDLE_BYTES:,}",
    )

No DB access needed. Fires before the presigned URL is issued, so the client never gets a URL to abuse.

1b. Unpack-bundle: verify wire size after MinIO GET

musehub/musehub/services/musehub_wire.pywire_push_unpack_bundle, after the MinIO GET:

if len(wire_bytes) > _MAX_BUNDLE_BYTES:
    raise ValueError(
        f"bundle {bundle_key[:20]} exceeds size limit: {len(wire_bytes):,} bytes"
    )

Defends against a client that bypasses the presign check and PUTs directly to MinIO.

1c. Route layer: sanity-bound client-supplied counts

musehub/musehub/api/routes/wire.pypush_unpack_bundle:

_MAX_COMMITS_PER_PUSH  = 50_000
_MAX_OBJECTS_PER_PUSH  = 500_000

if commits_count > _MAX_COMMITS_PER_PUSH or objects_count > _MAX_OBJECTS_PER_PUSH:
    raise HTTPException(status_code=422, detail="commits_count or objects_count exceeds limit")

These are logging inputs, not trusted counts — but bounding them prevents absurd log lines and future code that might use them for allocation.

1d. Move limits to config

All caps (_MAX_BUNDLE_BYTES, _MAX_COMMITS_PER_PUSH, _MAX_OBJECTS_PER_PUSH) should read from musehub/musehub/config.py so they can be tuned per environment (dev vs staging vs prod) without a code change.

Acceptance criteria:

  • POST /push/bundle-presign with size_bytes=600_000_000 returns 413
  • POST /push/unpack-bundle where MinIO blob exceeds cap raises 422
  • All caps live in config, not hardcoded

Phase 2 — Bundle content validation (background job)

These run inside the bundle.index background worker, not the sync path. Failures quarantine the bundle rather than crash the worker.

2a. Decompression size guard (zip bomb)

When the background job unpacks objects, track cumulative decompressed bytes:

_MAX_DECOMPRESSED_BYTES = 4 * 1024 * 1024 * 1024  # 4 GB

total_decompressed = 0
for obj in raw_objects:
    content = obj["content"]
    encoding = obj.get("encoding", "")
    if encoding == "zstd":
        raw = zstd.decompress(content)
    else:
        raw = content
    total_decompressed += len(raw)
    if total_decompressed > _MAX_DECOMPRESSED_BYTES:
        raise BundleValidationError("decompressed size exceeds limit — possible zip bomb")

2b. Actual object count verification

After unpacking, verify the actual counts match what the client declared:

if abs(len(raw_objects) - declared_objects_count) > 10:  # small tolerance
    raise BundleValidationError(
        f"object count mismatch: declared {declared_objects_count}, actual {len(raw_objects)}"
    )

2c. Per-object sha256 verification

Every object in the bundle: sha256(content) == object_id. Content-addressing is the proof — verify it here before writing to MinIO.

for obj in raw_objects:
    oid = obj["object_id"]
    _, expected_hex = split_id(oid)
    actual_hex = hashlib.sha256(obj["content"]).hexdigest()
    if actual_hex != expected_hex:
        raise BundleValidationError(f"object {oid[:20]} content does not match declared id")

2d. Quarantine state

Add a quarantine boolean column to musehub_bundles (or equivalent tracking table). On any validation failure:

  1. Mark bundle as quarantined
  2. Move blob to a non-public MinIO bucket (muse-quarantine)
  3. Log the failure with full detail
  4. Do NOT advance the branch pointer or write objects to the main bucket
  5. Notify gabriel (email or webhook) — configurable

Acceptance criteria:

  • A crafted bundle with mismatched object sha256 is quarantined, not indexed
  • A zip bomb bundle is caught before full decompression
  • Branch pointer is never advanced for a quarantined bundle

Phase 3 — Content scanning

This is the legally load-bearing phase. Required before any public launch that allows arbitrary user content.

3a. Known-hash blocklist

Maintain a musehub_blocked_hashes table (object_id text PRIMARY KEY). Before writing any object to MinIO, check this list. Seed it from:

  • NCMEC hash lists (CSAM)
  • Project VIC
  • Internal DMCA takedown history
blocked = await session.execute(
    select(db.MusehubBlockedHash.object_id)
    .where(db.MusehubBlockedHash.object_id.in_(all_object_ids))
)
if blocked.scalars().all():
    raise BundleValidationError("bundle contains blocked content")

O(1) per object against an indexed table — negligible cost.

3b. CSAM scanning integration

For image/binary objects above a size threshold, enqueue a secondary scan job that calls an external CSAM API (Microsoft PhotoDNA or equivalent). Objects stay in quarantine bucket until scan clears.

Architecture:

bundle.index job
  → writes objects to muse-quarantine (not muse-objects)
  → enqueues content.scan job per binary object

content.scan job
  → calls CSAM API
  → on PASS: moves object from muse-quarantine → muse-objects
  → on FAIL: keeps in quarantine, fires alert, suspends repo

3c. DMCA takedown infrastructure

Add POST /admin/takedown endpoint that:

  1. Adds object_id(s) to musehub_blocked_hashes
  2. Moves existing MinIO objects to quarantine
  3. Marks affected repos with a dmca_hold flag
  4. Returns a takedown receipt

Acceptance criteria:

  • An object matching a blocked hash is rejected at index time
  • DMCA takedown endpoint removes object from public access within one request
  • CSAM scan failure suspends the repo and fires an alert

Phase 4 — Rate limiting and abuse detection

4a. Per-user bundle rate limits

In musehub/musehub/api/routes/wire.py, tighten WIRE_PUSH_LIMIT for the bundle endpoints specifically. Track per-user daily bundle upload bytes in Redis (or pg) and reject when exceeded.

4b. Anomaly detection

Background job: compare current push volume (bytes, objects, commits) against the user's 30-day rolling average. If 10x above average, flag for review rather than auto-reject.

4c. Client-side hints (the one Muse CLI piece)

muse/muse/cli/commands/push.py — before presigning, check bundle size against the server's declared limit (returned in a /caps endpoint or in the presign error). Print a clear message:

❌ Bundle too large: 600 MB exceeds server limit of 512 MB.
   Split your push into smaller increments or contact [email protected].

This is UX, not enforcement. The server still enforces in Phase 1.


Implementation order

Phase 1 is one sitting and should ship immediately — it closes the most obvious DoS vector with near-zero risk.

Phase 2 ships with the bundle.index background worker (already planned).

Phase 3 is gated on legal review of which CSAM API to use and which jurisdictions require it. Do not launch public signups without at least 3a (known-hash blocklist).

Phase 4 is operational hardening — do it after the platform has real traffic data to set sensible thresholds.


Files touched

Phase File
1a, 1c musehub/musehub/api/routes/wire.py
1b musehub/musehub/services/musehub_wire.py
1d musehub/musehub/config.py
2a–2d musehub/musehub/workers/bundle_index.py (new)
2d musehub/musehub/db/musehub_models.py (quarantine column)
3a musehub/musehub/db/musehub_models.py (blocked hashes table)
3b musehub/musehub/workers/content_scan.py (new)
3c musehub/musehub/api/routes/admin.py
4c muse/muse/cli/commands/push.py
Activity4
gabriel opened this issue 40 days ago
gabriel 40 days ago

Phase 1 complete — size gates shipped

All three gates are live on dev.

What shipped

Gate 1 — presign rejects oversized bundles (413) POST /push/bundle-presign with size_bytes > bundle_max_bytes returns 413 before a presigned URL is ever issued. Client never gets a URL to abuse.

Gate 2 — unpack-bundle rejects oversized wire bytes (422) After the MinIO GET, len(wire_bytes) > bundle_max_bytes raises 422. Defends against a client that bypassed presign and PUT directly to MinIO.

Gate 3 — count bounds on client-supplied values (422) commits_count or objects_count above cap → 422 at the route layer, before MinIO is touched.

All caps in Settings (config.py) — tunable per environment:

bundle_max_bytes   = 512 MB
bundle_max_commits = 100k
bundle_max_objects = 1M

TDD: tests/test_bundle_size_gates.py — 6 tests, all green.


Phase 2 is next — bundle content validation (background job)

Phase 2 is blocked on the bundle.index background worker, which is not yet implemented. That worker is the next milestone (it is also required for fetch/clone/pull to work correctly). Phase 2 gates will be implemented TDD inside the worker as part of that milestone.

Phase 2 checklist when the worker lands:

  • Decompression size guard — track cumulative decompressed bytes; abort if > 4 GB (zip bomb defense)
  • Actual object count verification — assert len(raw_objects) within tolerance of objects_count declared at presign time
  • Per-object sha256 verification — sha256(content) == object_id for every object before writing to MinIO
  • Quarantine state — validation failure moves bundle to muse-quarantine bucket, does not advance branch pointer, fires alert
gabriel 40 days ago

Phase 2 complete — bundle content validation

All 8 tests passing in test_bundle_validation_phase2.py.

What was added

BundleValidationError — new terminal exception class; quarantined bundles are never retried.

Three validation checks in process_bundle_index_job, all running before any DB or MinIO writes:

Check Issue ref What it catches
Object count mismatch 2b abs(actual - declared_objects_count) > 10
Decompression size guard 2a cumulative_decompressed > bundle_max_decompressed_bytes (default 4 GB)
Per-object sha256 2c sha256(decompressed_content) != object_id

Quarantine mechanics:

  • quarantine_reason column on MusehubBackgroundJob (nullable text)
  • quarantine_job(session, job_id, reason) in musehub_jobs — sets status='quarantined', done_at, quarantine_reason
  • process_bundle_index_job marks the job row status='quarantined' in-session before raising; caller commits to persist
  • quarantine_bundle(bundle_key) on BlobBackend — copies bundle from bundles/ to quarantine/ prefix, deletes original (best-effort)

Config: bundle_max_decompressed_bytes: int = 4 * 1024 * 1024 * 1024 (4 GB default)

Job payload: declared_objects_count and declared_commits_count now stored in the bundle.index job payload for the count mismatch check.

Tests (8 new)

  • test_hash_mismatch_raises_bundle_validation_error
  • test_zip_bomb_raises_bundle_validation_error
  • test_count_mismatch_raises_bundle_validation_error
  • test_validation_failure_writes_no_objects_to_minio — spies on backend.put; zero calls on validation failure
  • test_validation_failure_writes_no_commits_to_db
  • test_quarantine_job_sets_status_and_reason
  • test_process_bundle_index_job_sets_quarantined_on_validation_error — catch + commit → status='quarantined'
  • test_valid_bundle_passes_all_validation_checks — regression: clean bundle still indexes normally

Phase 3 (content scanning / blocked-hash lookup) remains open.

gabriel 40 days ago

Phase 3 complete — content scanning and DMCA takedown

All 10 tests passing in test_bundle_content_scanning_phase3.py. 30/30 total across all bundle phases.

3a — Known-hash blocklist

MusehubBlockedHash table (object_id PK, reason, added_by, added_at). Seeded by DMCA takedown or manual import from NCMEC/Project VIC lists.

Check runs in process_bundle_index_job after Phase 2 validation, before any MinIO writes — a single IN query against an indexed table:

blocked = select(MusehubBlockedHash.object_id).where(object_id.in_(all_oids))
if blocked:
    raise BundleValidationError(f"bundle contains {len(blocked)} blocked object(s)...")

Error reports all matched IDs. Job status → quarantined.

3b — content.scan job infrastructure

After every successful bundle.index, a content.scan job is enqueued per indexed object (payload={object_id, repo_id, bundle_key}). The processor stub always returns clean — a real CSAM API (PhotoDNA, etc.) drops in via the CSAM_API_URL config once legal review completes. No architectural changes needed at integration time.

3c — DMCA takedown

POST /api/admin/takedown (admin-only, 403 for non-admin):

Input Effect
object_ids: [sha256:...] Upserted into musehub_blocked_hashes (idempotent)
Objects already in MinIO Moved to quarantine/ prefix via quarantine_object()
repo_ids: [sha256:...] dmca_hold=True set on each repo

Response: {blocked_count, repos_held, quarantined_count}

dmca_hold column added to musehub_repos — push gates and serve paths can enforce it.

Tests (10 new)

  • test_blocked_object_quarantines_bundle
  • test_blocked_check_fires_before_minio_puts — spy on backend.put, zero calls
  • test_multiple_blocked_objects_all_reported
  • test_clean_bundle_bypasses_blocklist_unimpeded — regression
  • test_dmca_takedown_adds_hashes_to_blocklist
  • test_dmca_takedown_marks_repos_dmca_hold
  • test_dmca_takedown_quarantines_existing_minio_objects
  • test_dmca_takedown_requires_admin
  • test_dmca_takedown_idempotent
  • test_content_scan_jobs_enqueued_after_indexing

Phase 4 (rate limiting / anomaly detection) remains open.

gabriel 40 days ago

Phase 4 complete ✅

Per-user daily byte limits, anomaly detection, and /api/caps — 15 tests, all green.

What was built

4a — Per-user daily byte limit

  • New table musehub_daily_push_bytes (identity_id, date, bytes_uploaded) — upserted at bundle-presign time via ON CONFLICT DO UPDATE … + size_bytes
  • bundle-presign checks the running daily total before issuing a URL; responds 429 with "daily upload limit … reached; try again tomorrow" when the user is at or over bundle_daily_upload_limit_bytes
  • New config setting bundle_daily_upload_limit_bytes (default 50 GB); set to 0 to disable
  • Limit is per-user — exhausting one user's quota never blocks another
  • record_bundle_bytes_uploaded(session, identity_id, size_bytes) — reusable service helper

4b — Anomaly detection

  • New table musehub_push_anomalies (anomaly_id, identity_id, bytes_today, rolling_avg_bytes, ratio, detected_at)
  • check_push_anomaly(session, identity_id, bytes_today) — computes 30-day rolling average from musehub_daily_push_bytes; if today's upload is >10× the average, inserts a musehub_push_anomaly row and logs a structured WARNING
  • Push is never rejected — detection is advisory only
  • No history → no flag (first-ever push is always clean)

4c — GET /api/caps

  • New public endpoint musehub/api/routes/api/caps.py
  • Returns { max_bundle_bytes, daily_upload_limit_bytes, max_commits_per_push, max_objects_per_push }
  • No auth required — Muse CLI calls this as a pre-flight hint

Tests

tests/test_bundle_rate_limiting_phase4.py — 15 tests, all passing alongside the 18 from phases 2 and 3.