Closed #49 Enhancement security

filed by gabriel human · 40 days ago

Bundle push security: size gates, content validation, and abuse defense

0 Anchors

— Blast radius

— Churn 30d

0 Proposals

Which repo

MuseHub, not Muse. Security enforcement must be server-side — client-side checks are UX, not security. Any caller hitting the API directly bypasses the CLI entirely. All four phases live in musehub except minor client hints in Phase 4.

Attack surface summary

The bundle push path has four exploitable properties today:

No size cap — presign accepts any size_bytes. A 50 GB bundle uploads successfully.
Content is opaque until background job runs — illegal content lands in MinIO before any scan.
Client-supplied counts are trusted — commits_count=1 with a 10M-object bundle passes the sync path and blows up the indexer.
No decompression guard — a 47 MB zstd bundle that expands to 4 GB crashes the background worker.

Phase 1 — Size gates (one sitting, zero risk)

These are the cheapest guards with the highest leverage. All changes in musehub/.

1a. Presign endpoint: reject oversized bundles before issuing a URL

musehub/musehub/api/routes/wire.py — push_bundle_presign:

_MAX_BUNDLE_BYTES = 512 * 1024 * 1024  # 512 MB — tune per tier

if size_bytes > _MAX_BUNDLE_BYTES:
    raise HTTPException(
        status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
        detail=f"bundle size {size_bytes:,} bytes exceeds limit {_MAX_BUNDLE_BYTES:,}",
    )

No DB access needed. Fires before the presigned URL is issued, so the client never gets a URL to abuse.

1b. Unpack-bundle: verify wire size after MinIO GET

musehub/musehub/services/musehub_wire.py — wire_push_unpack_bundle, after the MinIO GET:

if len(wire_bytes) > _MAX_BUNDLE_BYTES:
    raise ValueError(
        f"bundle {bundle_key[:20]} exceeds size limit: {len(wire_bytes):,} bytes"
    )

Defends against a client that bypasses the presign check and PUTs directly to MinIO.

1c. Route layer: sanity-bound client-supplied counts

musehub/musehub/api/routes/wire.py — push_unpack_bundle:

_MAX_COMMITS_PER_PUSH  = 50_000
_MAX_OBJECTS_PER_PUSH  = 500_000

if commits_count > _MAX_COMMITS_PER_PUSH or objects_count > _MAX_OBJECTS_PER_PUSH:
    raise HTTPException(status_code=422, detail="commits_count or objects_count exceeds limit")

These are logging inputs, not trusted counts — but bounding them prevents absurd log lines and future code that might use them for allocation.

1d. Move limits to config

All caps (_MAX_BUNDLE_BYTES, _MAX_COMMITS_PER_PUSH, _MAX_OBJECTS_PER_PUSH) should read from musehub/musehub/config.py so they can be tuned per environment (dev vs staging vs prod) without a code change.

Acceptance criteria:

POST /push/bundle-presign with size_bytes=600_000_000 returns 413
POST /push/unpack-bundle where MinIO blob exceeds cap raises 422
All caps live in config, not hardcoded

Phase 2 — Bundle content validation (background job)

These run inside the bundle.index background worker, not the sync path. Failures quarantine the bundle rather than crash the worker.

2a. Decompression size guard (zip bomb)

When the background job unpacks objects, track cumulative decompressed bytes:

_MAX_DECOMPRESSED_BYTES = 4 * 1024 * 1024 * 1024  # 4 GB

total_decompressed = 0
for obj in raw_objects:
    content = obj["content"]
    encoding = obj.get("encoding", "")
    if encoding == "zstd":
        raw = zstd.decompress(content)
    else:
        raw = content
    total_decompressed += len(raw)
    if total_decompressed > _MAX_DECOMPRESSED_BYTES:
        raise BundleValidationError("decompressed size exceeds limit — possible zip bomb")

2b. Actual object count verification

After unpacking, verify the actual counts match what the client declared:

if abs(len(raw_objects) - declared_objects_count) > 10:  # small tolerance
    raise BundleValidationError(
        f"object count mismatch: declared {declared_objects_count}, actual {len(raw_objects)}"
    )

2c. Per-object sha256 verification

Every object in the bundle: sha256(content) == object_id. Content-addressing is the proof — verify it here before writing to MinIO.

for obj in raw_objects:
    oid = obj["object_id"]
    _, expected_hex = split_id(oid)
    actual_hex = hashlib.sha256(obj["content"]).hexdigest()
    if actual_hex != expected_hex:
        raise BundleValidationError(f"object {oid[:20]} content does not match declared id")

2d. Quarantine state

Add a quarantine boolean column to musehub_bundles (or equivalent tracking table). On any validation failure:

Mark bundle as quarantined
Move blob to a non-public MinIO bucket (muse-quarantine)
Log the failure with full detail
Do NOT advance the branch pointer or write objects to the main bucket
Notify gabriel (email or webhook) — configurable

Acceptance criteria:

A crafted bundle with mismatched object sha256 is quarantined, not indexed
A zip bomb bundle is caught before full decompression
Branch pointer is never advanced for a quarantined bundle

Phase 3 — Content scanning

This is the legally load-bearing phase. Required before any public launch that allows arbitrary user content.

3a. Known-hash blocklist

Maintain a musehub_blocked_hashes table (object_id text PRIMARY KEY). Before writing any object to MinIO, check this list. Seed it from:

NCMEC hash lists (CSAM)
Project VIC
Internal DMCA takedown history

blocked = await session.execute(
    select(db.MusehubBlockedHash.object_id)
    .where(db.MusehubBlockedHash.object_id.in_(all_object_ids))
)
if blocked.scalars().all():
    raise BundleValidationError("bundle contains blocked content")

O(1) per object against an indexed table — negligible cost.

3b. CSAM scanning integration

For image/binary objects above a size threshold, enqueue a secondary scan job that calls an external CSAM API (Microsoft PhotoDNA or equivalent). Objects stay in quarantine bucket until scan clears.

Architecture:

bundle.index job
  → writes objects to muse-quarantine (not muse-objects)
  → enqueues content.scan job per binary object

content.scan job
  → calls CSAM API
  → on PASS: moves object from muse-quarantine → muse-objects
  → on FAIL: keeps in quarantine, fires alert, suspends repo

3c. DMCA takedown infrastructure

Add POST /admin/takedown endpoint that:

Adds object_id(s) to musehub_blocked_hashes
Moves existing MinIO objects to quarantine
Marks affected repos with a dmca_hold flag
Returns a takedown receipt

Acceptance criteria:

An object matching a blocked hash is rejected at index time
DMCA takedown endpoint removes object from public access within one request
CSAM scan failure suspends the repo and fires an alert

Phase 4 — Rate limiting and abuse detection

4a. Per-user bundle rate limits

In musehub/musehub/api/routes/wire.py, tighten WIRE_PUSH_LIMIT for the bundle endpoints specifically. Track per-user daily bundle upload bytes in Redis (or pg) and reject when exceeded.

4b. Anomaly detection

Background job: compare current push volume (bytes, objects, commits) against the user's 30-day rolling average. If 10x above average, flag for review rather than auto-reject.

4c. Client-side hints (the one Muse CLI piece)

muse/muse/cli/commands/push.py — before presigning, check bundle size against the server's declared limit (returned in a /caps endpoint or in the presign error). Print a clear message:

❌ Bundle too large: 600 MB exceeds server limit of 512 MB.
   Split your push into smaller increments or contact [email protected].

This is UX, not enforcement. The server still enforces in Phase 1.

Implementation order

Phase 1 is one sitting and should ship immediately — it closes the most obvious DoS vector with near-zero risk.

Phase 2 ships with the bundle.index background worker (already planned).

Phase 3 is gated on legal review of which CSAM API to use and which jurisdictions require it. Do not launch public signups without at least 3a (known-hash blocklist).

Phase 4 is operational hardening — do it after the platform has real traffic data to set sensible thresholds.

Files touched

Phase	File
1a, 1c	`musehub/musehub/api/routes/wire.py`
1b	`musehub/musehub/services/musehub_wire.py`
1d	`musehub/musehub/config.py`
2a–2d	`musehub/musehub/workers/bundle_index.py` (new)
2d	`musehub/musehub/db/musehub_models.py` (quarantine column)
3a	`musehub/musehub/db/musehub_models.py` (blocked hashes table)
3b	`musehub/musehub/workers/content_scan.py` (new)
3c	`musehub/musehub/api/routes/admin.py`
4c	`muse/muse/cli/commands/push.py`

◎ Activity4

●

gabriel opened this issue 40 days ago

○

gabriel 40 days ago

Phase 1 complete — size gates shipped

All three gates are live on dev.

What shipped

Gate 1 — presign rejects oversized bundles (413) POST /push/bundle-presign with size_bytes > bundle_max_bytes returns 413 before a presigned URL is ever issued. Client never gets a URL to abuse.

Gate 2 — unpack-bundle rejects oversized wire bytes (422) After the MinIO GET, len(wire_bytes) > bundle_max_bytes raises 422. Defends against a client that bypassed presign and PUT directly to MinIO.

Gate 3 — count bounds on client-supplied values (422) commits_count or objects_count above cap → 422 at the route layer, before MinIO is touched.

All caps in Settings (config.py) — tunable per environment:

bundle_max_bytes   = 512 MB
bundle_max_commits = 100k
bundle_max_objects = 1M

TDD: tests/test_bundle_size_gates.py — 6 tests, all green.

Phase 2 is next — bundle content validation (background job)

Phase 2 is blocked on the bundle.index background worker, which is not yet implemented. That worker is the next milestone (it is also required for fetch/clone/pull to work correctly). Phase 2 gates will be implemented TDD inside the worker as part of that milestone.

Phase 2 checklist when the worker lands:

Decompression size guard — track cumulative decompressed bytes; abort if > 4 GB (zip bomb defense)
Actual object count verification — assert len(raw_objects) within tolerance of objects_count declared at presign time
Per-object sha256 verification — sha256(content) == object_id for every object before writing to MinIO
Quarantine state — validation failure moves bundle to muse-quarantine bucket, does not advance branch pointer, fires alert

○

gabriel 40 days ago

Phase 2 complete — bundle content validation

All 8 tests passing in test_bundle_validation_phase2.py.

What was added

BundleValidationError — new terminal exception class; quarantined bundles are never retried.

Three validation checks in process_bundle_index_job, all running before any DB or MinIO writes:

Check	Issue ref	What it catches
Object count mismatch	2b	`abs(actual - declared_objects_count) > 10`
Decompression size guard	2a	`cumulative_decompressed > bundle_max_decompressed_bytes` (default 4 GB)
Per-object sha256	2c	`sha256(decompressed_content) != object_id`

Quarantine mechanics:

quarantine_reason column on MusehubBackgroundJob (nullable text)
quarantine_job(session, job_id, reason) in musehub_jobs — sets status='quarantined', done_at, quarantine_reason
process_bundle_index_job marks the job row status='quarantined' in-session before raising; caller commits to persist
quarantine_bundle(bundle_key) on BlobBackend — copies bundle from bundles/ to quarantine/ prefix, deletes original (best-effort)

Config: bundle_max_decompressed_bytes: int = 4 * 1024 * 1024 * 1024 (4 GB default)

Job payload: declared_objects_count and declared_commits_count now stored in the bundle.index job payload for the count mismatch check.

Tests (8 new)

test_hash_mismatch_raises_bundle_validation_error
test_zip_bomb_raises_bundle_validation_error
test_count_mismatch_raises_bundle_validation_error
test_validation_failure_writes_no_objects_to_minio — spies on backend.put; zero calls on validation failure
test_validation_failure_writes_no_commits_to_db
test_quarantine_job_sets_status_and_reason
test_process_bundle_index_job_sets_quarantined_on_validation_error — catch + commit → status='quarantined'
test_valid_bundle_passes_all_validation_checks — regression: clean bundle still indexes normally

Phase 3 (content scanning / blocked-hash lookup) remains open.

○

gabriel 40 days ago

Phase 3 complete — content scanning and DMCA takedown

All 10 tests passing in test_bundle_content_scanning_phase3.py. 30/30 total across all bundle phases.

3a — Known-hash blocklist

MusehubBlockedHash table (object_id PK, reason, added_by, added_at). Seeded by DMCA takedown or manual import from NCMEC/Project VIC lists.

Check runs in process_bundle_index_job after Phase 2 validation, before any MinIO writes — a single IN query against an indexed table:

blocked = select(MusehubBlockedHash.object_id).where(object_id.in_(all_oids))
if blocked:
    raise BundleValidationError(f"bundle contains {len(blocked)} blocked object(s)...")

Error reports all matched IDs. Job status → quarantined.

3b — content.scan job infrastructure

After every successful bundle.index, a content.scan job is enqueued per indexed object (payload={object_id, repo_id, bundle_key}). The processor stub always returns clean — a real CSAM API (PhotoDNA, etc.) drops in via the CSAM_API_URL config once legal review completes. No architectural changes needed at integration time.

3c — DMCA takedown

POST /api/admin/takedown (admin-only, 403 for non-admin):

Input	Effect
`object_ids: [sha256:...]`	Upserted into `musehub_blocked_hashes` (idempotent)
Objects already in MinIO	Moved to `quarantine/` prefix via `quarantine_object()`
`repo_ids: [sha256:...]`	`dmca_hold=True` set on each repo

Response: {blocked_count, repos_held, quarantined_count}

dmca_hold column added to musehub_repos — push gates and serve paths can enforce it.

Tests (10 new)

test_blocked_object_quarantines_bundle
test_blocked_check_fires_before_minio_puts — spy on backend.put, zero calls
test_multiple_blocked_objects_all_reported
test_clean_bundle_bypasses_blocklist_unimpeded — regression
test_dmca_takedown_adds_hashes_to_blocklist
test_dmca_takedown_marks_repos_dmca_hold
test_dmca_takedown_quarantines_existing_minio_objects
test_dmca_takedown_requires_admin
test_dmca_takedown_idempotent
test_content_scan_jobs_enqueued_after_indexing

Phase 4 (rate limiting / anomaly detection) remains open.

○

gabriel 40 days ago

Phase 4 complete ✅

Per-user daily byte limits, anomaly detection, and /api/caps — 15 tests, all green.

What was built

4a — Per-user daily byte limit

New table musehub_daily_push_bytes (identity_id, date, bytes_uploaded) — upserted at bundle-presign time via ON CONFLICT DO UPDATE … + size_bytes
bundle-presign checks the running daily total before issuing a URL; responds 429 with "daily upload limit … reached; try again tomorrow" when the user is at or over bundle_daily_upload_limit_bytes
New config setting bundle_daily_upload_limit_bytes (default 50 GB); set to 0 to disable
Limit is per-user — exhausting one user's quota never blocks another
record_bundle_bytes_uploaded(session, identity_id, size_bytes) — reusable service helper

4b — Anomaly detection

New table musehub_push_anomalies (anomaly_id, identity_id, bytes_today, rolling_avg_bytes, ratio, detected_at)
check_push_anomaly(session, identity_id, bytes_today) — computes 30-day rolling average from musehub_daily_push_bytes; if today's upload is >10× the average, inserts a musehub_push_anomaly row and logs a structured WARNING
Push is never rejected — detection is advisory only
No history → no flag (first-ever push is always clean)

4c — GET /api/caps

New public endpoint musehub/api/routes/api/caps.py
Returns { max_bundle_bytes, daily_upload_limit_bytes, max_commits_per_push, max_objects_per_push }
No auth required — Muse CLI calls this as a pre-flight hint

Tests

tests/test_bundle_rate_limiting_phase4.py — 15 tests, all passing alongside the 18 from phases 2 and 3.

Assignee

gabriel human

Release

no commits linked to this issue

create

muse hub issue create \
  --title "..." \
  --body "..." \
  --label bug \
  --anchor path/to/file.py::Symbol \
  --commit-anchor <sha> \
  --repo gabriel/musehub

read

muse hub issue get 49 --json
muse hub issue list --state open --json

update

muse hub issue edit 49 \
  --anchor path/to/file.py::Symbol \
  --repo gabriel/musehub

comment

muse hub issue comment 49 \
  --body "Fixed in <sha>" \
  --repo gabriel/musehub

reopen

muse hub issue reopen 49 \
  --repo gabriel/musehub

create

create_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  title: "...",
  body: "...",
  labels: ["bug"],
  symbol_anchors: [
    "path/to/file.py::Symbol"
  ],
  commit_anchors: ["<sha>"]
})

read

get_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 49
})

list_issues({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  state: "open"
})

update

edit_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 49,
  symbol_anchors: ["path/to/file.py::Symbol"]
})

comment

create_issue_comment({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 49,
  body: "..."
})

reopen

reopen_issue({
  repo_id: "sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75",
  issue_number: 49
})

create

curl -X POST \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{
    "title": "...",
    "body": "...",
    "labels": ["bug"],
    "symbol_anchors": ["path/to/file.py::Symbol"]
  }'

read

# get one issue
curl http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/49

# list open issues
curl "http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues?state=open"

update

curl -X PATCH \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/49 \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{"title": "...", "body": "..."}'

comment

curl -X POST \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/49/comments \
  -H "Content-Type: application/json" \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\"" \
  -d '{"body": "Fixed in <sha>"}'

reopen

curl -X POST \
  http://localhost:10003/api/repos/sha256:a265796360c3b1b8700b5682ced5f6b044a2c0d3a2c58918892a5aa494db6c75/issues/49/reopen \
  -H "Authorization: MSign handle=\"...\" ts=... sig=\"...\""