Agent Coordination
Muse's coordination layer lets multiple agents work a shared codebase in parallel without stepping on each other. It provides file-backed work queues with POSIX-atomic claiming, advisory symbol reservations with heartbeat keep-alive, a dependency DAG for ordering, import-graph sharding for zero-interference parallelism, and a three-pass conflict forecast before a single line of code changes.
.muse/coordination/ as content-addressed JSON. Remote sync to MuseHub
is optional; everything works offline.
Overview
Two independent primitives compose the system:
| Primitive | Purpose | Atomicity |
|---|---|---|
| Task queue | Distribute discrete units of work across agents. Producer enqueues; consumers race to claim; winner completes or fails. | open(O_CREAT|O_EXCL) — POSIX-guaranteed single winner |
| Reservations | Advisory symbol-address leases. Agents declare intent before touching a symbol; muse coord forecast detects overlaps before they happen. |
write_text_atomic() (atomic rename); content-addressed ID prevents duplicates |
A typical multi-agent pipeline:
# Orchestrator shards the repo into N parallel zones muse coord shard --agents 4 --json # Orchestrator enqueues one task per shard muse coord enqueue "Refactor shard 1" --queue refactor \ --payload '{"files":["src/billing.py","src/models.py"]}' \ --run-id orchestrator --json # Each worker claims one task muse coord claim --queue refactor --run-id agent-1 --json # Worker reserves the symbols it will touch muse coord reserve "src/billing.py::compute_total" \ --run-id agent-1 --op modify --ttl 7200 --json # Forecast before any writes muse coord forecast --json # Work … commit … then release and complete muse coord release <reservation_id> --run-id agent-1 --json muse coord complete <task_id> --run-id agent-1 --json
Multi-agent scenario — orchestrator + 3 workers
This scenario shows a complete pipeline: one orchestrator shards the repo, enqueues three tasks, three workers race to claim them, a forecast catches a symbol overlap before any code changes, and the orchestrator collects all results.
Step 1 — shard and enqueue
# 1. Split the import graph into 3 parallel-safe zones
muse coord shard --agents 3 --json \
| python3 -c "import sys,json; [print(s['shard'], s['files']) for s in json.load(sys.stdin)['shards']]"
# 1 ['src/billing.py', 'src/models.py']
# 2 ['src/auth.py', 'src/tokens.py']
# 3 ['src/api.py', 'src/middleware.py']
# 2. Enqueue one task per shard
muse coord enqueue "Refactor shard 1" --queue refactor --priority 10 \
--payload '{"files":["src/billing.py","src/models.py"]}' \
--run-id orchestrator --json
muse coord enqueue "Refactor shard 2" --queue refactor --priority 10 \
--payload '{"files":["src/auth.py","src/tokens.py"]}' \
--run-id orchestrator --json
muse coord enqueue "Refactor shard 3" --queue refactor --priority 10 \
--payload '{"files":["src/api.py","src/middleware.py"]}' \
--run-id orchestrator --json
Step 2 — workers race to claim
# Each worker claims from the refactor queue (POSIX-atomic, one winner per task)
TASK=$(muse coord claim --queue refactor --run-id agent-1 --json)
TASK_ID=$(echo "$TASK" | python3 -c "import sys,json; print(json.load(sys.stdin)['task_id'])")
FILES=$(echo "$TASK" | python3 -c "import sys,json; print(json.load(sys.stdin)['payload']['files'])")
{
"task_id": "sha256:4e8b2f1a3c7d...",
"title": "Refactor shard 1",
"queue": "refactor",
"claimer_run_id": "agent-1",
"claimed_at": "2026-04-21T16:00:00Z",
"expires_at": "2026-04-21T17:00:00Z",
"payload": { "files": ["src/billing.py", "src/models.py"] }
}
Step 3 — reserve symbols, run forecast
# agent-1: billing shard
muse coord reserve "src/billing.py::compute_total" \
--run-id agent-1 --op modify --ttl 7200 --json
# agent-2: auth shard — also touches compute_total (import chain)
muse coord reserve "src/billing.py::compute_total" \
--run-id agent-2 --op rename --ttl 7200 --json
# orchestrator: forecast before any agent writes code
muse coord forecast --json
{
"active_reservations": 3,
"call_graph_available": true,
"partial_forecast": false,
"conflicts": [
{
"conflict_type": "address_overlap",
"addresses": ["src/billing.py::compute_total"],
"agents": ["agent-1@feat/refactor", "agent-2@feat/auth"],
"confidence": 1.0,
"description": "Symbol reserved by 2 agents on different branches"
},
{
"conflict_type": "operation_conflict",
"addresses": ["src/billing.py::compute_total"],
"agents": ["agent-1@feat/refactor", "agent-2@feat/auth"],
"confidence": 0.9,
"description": "Incompatible operations: modify vs rename on same symbol"
}
],
"high_risk": 2,
"medium_risk": 0,
"low_risk": 0
}
Orchestrator sees the conflict and serializes the work: agent-1 renames first,
agent-2's reservation is updated to depends-on agent-1's:
# Agent-2 releases its reservation and re-reserves with depends-on
muse coord release <agent-2-reservation-id> --run-id agent-2 --json
muse coord reserve "src/billing.py::compute_total" \
--run-id agent-2 --op modify --ttl 7200 \
--depends-on <agent-1-reservation-id> --json
Step 4 — work, commit, release, complete
# ... agent does work on its branch ...
muse commit -m "refactor: shard 1 billing module" \
--agent-id claude-code --model-id claude-sonnet-4-6 --sign
# Release all reservations for this run
muse coord release --all-for-run agent-1 --run-id agent-1 --json
# Complete the task with a structured result
muse coord complete sha256:4e8b2f1a3c7d... \
--run-id agent-1 \
--result '{"symbols_modified": 12, "tests_passing": true}' \
--json
Step 5 — orchestrator collects results
muse coord tasks --queue refactor --status completed --json \
| python3 -c "
import sys, json
data = json.load(sys.stdin)
for t in data['tasks']:
print(t['title'], '→', t['result'])
"
Refactor shard 1 → {'symbols_modified': 12, 'tests_passing': True}
Refactor shard 2 → {'symbols_modified': 8, 'tests_passing': True}
Refactor shard 3 → {'symbols_modified': 5, 'tests_passing': True}
Work queues
Tasks are content-addressed by their inputs — same title + queue + payload + priority
+ creator always produces the same task_id. Queues are named strings;
any alphanumeric-plus-hyphen-or-underscore name up to 64 chars is valid.
TaskRecord fields
| Field | Type | Description |
|---|---|---|
task_id | str (sha256:…) | Content-addressed; same inputs → same ID (idempotent enqueue) |
title | str ≤256 | Short human-readable description of the work unit |
payload | dict | Arbitrary JSON — agent-specific data passed to the claimer |
priority | int | Higher = claimed first; ties broken by created_at FIFO |
queue | str ≤64 | Queue name [a-zA-Z0-9_-]+; default "default" |
created_at | datetime (UTC) | Enqueue timestamp |
created_by | str ≤256 | run_id of the enqueuing agent |
ttl_seconds | int | Seconds from created_at before pending task expires; default 86400 (24h) |
tags | list[str] | Up to 32 tags, each ≤64 chars; for filtering and grouping |
muse coord enqueue "Lint billing module" \ --queue lint \ --priority 10 \ --payload '{"file": "src/billing.py"}' \ --ttl 3600 \ --tags billing python \ --run-id orch --json
Claim lifecycle
Task state is derived from two files: the task record (immutable) and the claim record
(mutable). The claim file is created with open(O_CREAT|O_EXCL) — POSIX
guarantees exactly one agent wins the race; all others get FileExistsError.
pending ──[O_CREAT|O_EXCL, one winner]──→ claimed
│
┌───────────────────────────┼────────────────┐
↓ ↓ ↓
completed failed cancelled
(result set) (error set)
↑ Re-claim path (timed_out):
claimed ──[expires_at ≤ now]──→ timed_out ──[write_atomic + nonce]──→ claimed
ClaimRecord fields
| Field | Type | Description |
|---|---|---|
task_id | str (sha256:…) | Links to the immutable TaskRecord |
claimer_run_id | str ≤256 | Agent that claimed the task |
claimed_at | datetime (UTC) | Initial claim timestamp |
expires_at | datetime (UTC) | Claim TTL; default 3600s; extended by heartbeat |
status | str | "claimed" "completed" "failed" "cancelled" "timed_out" |
heartbeat_at | datetime (UTC) | Last heartbeat timestamp |
claim_nonce | str | Optimistic concurrency token for re-claim race resolution |
result | dict | null | Set by muse coord complete --result JSON |
error | str | null | Set by muse coord fail-task --error MSG |
Claiming and re-claiming
muse coord claim scans the queue sorted by priority descending, then
created_at ascending (FIFO within a priority band). It tries each candidate in
order:
- Pending task —
open(O_CREAT|O_EXCL). POSIX-atomic. First agent to call wins; rest loop to the next candidate. - Timed-out task —
write_text_atomicwith a freshclaim_nonce, then read back. If the nonce matches, the agent won the re-claim race. If not, another agent beat it; loop to the next candidate.
muse coord claim --queue refactor --run-id agent-1 --json muse coord complete <task_id> --run-id agent-1 \ --result '{"symbols_renamed": 7}' --json muse coord fail-task <task_id> --run-id agent-1 \ --error "AST parse failed on line 42" --json muse coord cancel-task <task_id> --run-id orch --json muse coord tasks --queue refactor --status pending --json
claimer_run_id may call complete, fail-task,
or heartbeat on a claim. Orchestrators use cancel-task.
Reservations
A reservation is an advisory symbol-address lease. It does not prevent writes
(no lock server required) — it signals intent so muse coord forecast
can detect overlaps before they produce merge conflicts.
ReservationRecord fields
| Field | Type | Description |
|---|---|---|
reservation_id | str (sha256:…) | Content-addressed: SHA-256(run_id, branch, sorted(addresses), operation) |
run_id | str ≤256 | Agent identifier — use your pipeline ID |
branch | str | Branch this agent is working on |
addresses | list[str] | Symbol addresses; fnmatch glob patterns supported ("billing.py::*") |
created_at | datetime (UTC) | Reservation creation time |
expires_at | datetime (UTC) | Hard TTL expiry; extended by heartbeat |
operation | str | null | "modify" "rename" "delete" "extract" "move" — used by Tier 3 forecast |
muse coord reserve
twice with the same arguments returns the same ID and writes a single file. Idempotent.
# reserve one symbol muse coord reserve "src/billing.py::compute_total" \ --run-id agent-1 --op modify --ttl 7200 --json # reserve an entire file with a glob muse coord reserve "src/billing.py::*" \ --run-id agent-1 --op rename --json # release when done muse coord release <reservation_id> --run-id agent-1 --json # release everything for a run (end of pipeline) muse coord release --all-for-run agent-1 --run-id agent-1 --json
Active-reservation check
A reservation is active when all of the following hold:
- No release tombstone exists for
reservation_id now < max(expires_at, heartbeat.extended_expires_at)
Once a release tombstone is written it is checked before full JSON parsing — the stem-only scan makes active-check O(n) in file count, not file size.
Heartbeat — TTL extension
Long-running agents extend their reservations and claim TTLs with heartbeats.
Each heartbeat atomically rewrites a single file with a new extended_expires_at.
| Field | Description |
|---|---|
reservation_id | Which reservation this keep-alive extends |
run_id | Must match the reservation's run_id |
last_beat_at | This heartbeat's timestamp |
extended_expires_at | now + extension_seconds — new effective expiry |
# extend reservation TTL muse coord heartbeat <reservation_id> --run-id agent-1 --json # extend task claim TTL muse coord heartbeat <task_id> --run-id agent-1 --json # recommended poll interval: extension_seconds / 2 # default extension: 3600s → heartbeat every ~1800s
Reservation TTL defaults: 3600s. Claim TTL defaults: 3600s. Task pending TTL defaults: 86400s (24h). Extension range: 1s – 31,536,000s (1 year).
Dependency DAG
Reservations can declare dependencies on other reservations — "I cannot start until agent-B's rename is complete." The DAG enforces ordering without a central scheduler. Acyclicity is checked at write time; cycles are structurally impossible on disk.
Three-step dependency chain
A rename must happen before modify, which must happen before the test update.
Each reservation declares its predecessor with --depends-on:
# Step 1 — agent-A renames the symbol (no dependency)
RENAME_ID=$(muse coord reserve "src/billing.py::compute_total" \
--run-id agent-A --op rename --json \
| python3 -c "import sys,json; print(json.load(sys.stdin)['reservation_id'])")
# Step 2 — agent-B modifies the renamed symbol (waits for step 1)
MODIFY_ID=$(muse coord reserve "src/billing.py::compute_invoice_total" \
--run-id agent-B --op modify --depends-on $RENAME_ID --json \
| python3 -c "import sys,json; print(json.load(sys.stdin)['reservation_id'])")
# Step 3 — agent-C updates tests (waits for step 2)
muse coord reserve "tests/test_billing.py::test_total" \
--run-id agent-C --op modify --depends-on $MODIFY_ID --json
Check whether agent-B is blocked before it starts work:
muse coord dag --json
{
"nodes": [
{ "reservation_id": "a1b2c3...", "run_id": "agent-A",
"addresses": ["src/billing.py::compute_total"],
"operation": "rename", "status": "active", "depends_on": [] },
{ "reservation_id": "d4e5f6...", "run_id": "agent-B",
"addresses": ["src/billing.py::compute_invoice_total"],
"operation": "modify", "status": "blocked", "depends_on": ["a1b2c3..."] },
{ "reservation_id": "g7h8i9...", "run_id": "agent-C",
"addresses": ["tests/test_billing.py::test_total"],
"operation": "modify", "status": "blocked", "depends_on": ["d4e5f6..."] }
],
"edges": [
{ "from": "a1b2c3...", "to": "d4e5f6...", "reason": "depends_on" },
{ "from": "d4e5f6...", "to": "g7h8i9...", "reason": "depends_on" }
],
"topological_order": ["a1b2c3...", "d4e5f6...", "g7h8i9..."],
"cycles": [],
"active_count": 1,
"blocked_count": 2
}
agent-A rename compute_total [ACTIVE]
└─[depends_on]──▶ agent-B modify compute_invoice_total [BLOCKED]
└─[depends_on]──▶ agent-C modify test_total [BLOCKED]
# inspect the full DAG
muse coord dag --json
# show only active (unblocked) nodes
muse coord dag --active-only --json
DependencyRecord fields
| Field | Description |
|---|---|
reservation_id | The dependent — this reservation waits |
depends_on | List of reservation IDs it waits for (max 256) |
created_at | UTC timestamp |
is_blocked(reservation_id) returns true if any dependency
is still active. get_blocking(reservation_id) returns the list of active
dependencies. DAG output includes topological order (dependencies before dependents —
safe execution sequence) and detects cycles if one somehow exists.
Sharding
muse coord shard partitions the import graph into parallel-safe work zones.
Agents assigned to different shards never touch the same files, eliminating merge
conflicts by construction.
Algorithm
- Build the import graph from the HEAD snapshot: file → file edges from import pseudo-symbols.
- Compute weakly-connected components via DFS.
- Apply LPT (Longest Processing Time) heuristic: greedily assign the largest component to the shard with the fewest symbols so far, minimizing makespan.
- Count
cross_shard_edges: imports that cross shard boundaries (lower = better isolation).
muse coord shard --agents 4 --json muse coord shard --agents 8 --language Python --json muse coord shard --agents 4 --commit HEAD~10 --json
ShardPlan output
{
"commit": "f3c14a89", // 8-char display ID
"full_commit_id": "sha256:f3c14a89cd37…",
"agents": 4,
"shards_created": 4, // may be < agents if repo has fewer components
"total_files": 87,
"total_symbols": 1204,
"cross_shard_edges": 3, // lower = better isolation
"shards": [
{
"shard": 1,
"files": ["src/billing.py", "src/models.py"],
"symbol_count": 312,
"coupling_score": 1 // edges from this shard to others
}
]
}
--agents,
shards_created < agents. The extra agents have nothing to claim and
should idle or handle a different queue.
Conflict forecasting
muse coord forecast predicts merge conflicts before any agent writes code.
It runs three independent passes over active reservations and intents, each with a
calibrated confidence score.
| Pass | Conflict type | Confidence | Condition |
|---|---|---|---|
| 1 | "address_overlap" |
1.0 | Same symbol address reserved by two or more different run_ids |
| 2 | "blast_radius_overlap" |
0.75 | One reserved address is a transitive caller of another (requires muse code index); skipped if call graph unavailable |
| 3 | "operation_conflict" |
0.9 | Intents on the same address combine incompatible operations: delete vs modify, rename, or extract |
muse coord forecast --json muse coord forecast --branch feat/auth --json muse coord forecast --min-confidence 0.9 --json # high-risk only
Output
{
"active_reservations": 12,
"call_graph_available": true,
"partial_forecast": false,
"conflicts": [
{
"conflict_type": "address_overlap",
"addresses": ["src/billing.py::compute_total"],
"agents": ["agent-1@feat/auth", "agent-2@feat/payments"],
"confidence": 1.0,
"description": "Symbol reserved by 2 agents on different branches"
}
],
"high_risk": 1, // confidence >= 0.9
"medium_risk": 0, // 0.5 <= confidence < 0.9
"low_risk": 0 // confidence < 0.5
}
Agents are shown as "run_id@branch". Pass 2 is silently skipped when the
code intelligence index is unavailable, setting partial_forecast: true.
Watch — real-time event stream
muse coord watch streams coordination events as NDJSON (one JSON object
per line). On macOS/BSD it uses kqueue (zero CPU between events); on Linux it falls
back to polling at a configurable interval (default 1.0s).
muse coord watch --json muse coord watch --once --json # exit after first event muse coord watch --kind reservation --json # filter by kind muse coord watch --run-id agent-1 --json # filter by agent
Event types and fields
| Field | Values | Description |
|---|---|---|
event_type | "snapshot" "added" "modified" "removed" "expired" | What happened to the record |
kind | "reservation" "intent" "release" "heartbeat" | Type of coordination record |
id | str | ID of the changed record |
timestamp | datetime (UTC) | When the event occurred |
data | dict | Parsed record content; empty dict for removed records with no cache |
"snapshot" events fire on startup for each existing record — useful for
replaying current state before watching for changes.
Remote sync
Coordination state can be pushed to and pulled from MuseHub, enabling multi-machine
agent pipelines. Remote records are stored read-only under
.muse/coordination/remote/.
# push local coordination records to MuseHub (batches of 500) muse coord sync push \ --hub http://staging.musehub.ai \ --owner gabriel --slug muse --json # pull remote records (paginated, use --since-id for incremental) muse coord sync pull \ --hub http://staging.musehub.ai \ --owner gabriel --slug muse \ --since-id 1024 --limit 500 --json
Push is idempotent: the hub responds with {"inserted": N, "skipped": M}.
Pull returns a cursor to use as --since-id in the next call.
Supported kinds: reservation, intent, release,
heartbeat, task, claim, dependency.
Storage layout
All coordination state is file-backed under .muse/coordination/. Every
file is JSON. IDs are content-addressed sha256 hex strings used directly as filenames.
.muse/coordination/ reservations/<sha256-hex>.json ← write-once advisory symbol lease intents/<sha256-hex>.json ← write-once operation declaration (no TTL) releases/<sha256-hex>.json ← write-once tombstone (completed/cancelled) heartbeats/<sha256-hex>.json ← mutable keep-alive; atomically overwritten tasks/<sha256-hex>.json ← immutable task definition claims/<sha256-hex>.json ← mutable claim state (O_CREAT|O_EXCL then atomic) dependencies/<sha256-hex>.json ← write-once DAG edges remote/ ← records synced from MuseHub (read-only)
GC policy
muse coord gc collects stale records after a configurable grace period
(default 300s = 5 min, protecting against agents mid-read). Run dry-run first:
muse coord gc --json # dry-run, shows what would be collected muse coord gc --execute --json # actually delete
| Collected | Condition |
|---|---|
| Expired reservations | TTL exhausted (including heartbeat extensions), past grace period |
| Released reservations | Release tombstone present, older than grace period |
| Orphaned releases | Tombstone exists but reservation file gone |
| Orphaned heartbeats | Heartbeat exists but reservation file gone |
| Old intents (opt-in) | --include-intents; older than --max-intent-age (default 7 days) |
CLI reference
All commands accept --json.
| Task | Command |
|---|---|
| Enqueue a task | muse coord enqueue "title" --run-id orch --json |
| Claim next pending task | muse coord claim --run-id agent-1 --json |
| Claim from specific queue | muse coord claim --queue refactor --run-id agent-1 --json |
| Complete a task | muse coord complete <task_id> --run-id agent-1 --json |
| Fail a task | muse coord fail-task <task_id> --run-id agent-1 --error "msg" --json |
| Cancel a task | muse coord cancel-task <task_id> --run-id orch --json |
| List tasks | muse coord tasks --status pending --json |
| Reserve symbol addresses | muse coord reserve "billing.py::Fn" --run-id agent-1 --op modify --json |
| Reserve with TTL | muse coord reserve "billing.py::Fn" --run-id agent-1 --ttl 7200 --json |
| Reserve with dependency | muse coord reserve "billing.py::Fn" --run-id agent-B --depends-on <id> --json |
| Release a reservation | muse coord release <id> --run-id agent-1 --json |
| Release all for a run | muse coord release --all-for-run agent-1 --run-id agent-1 --json |
| Heartbeat (reservation) | muse coord heartbeat <reservation_id> --run-id agent-1 --json |
| Heartbeat (claim) | muse coord heartbeat <task_id> --run-id agent-1 --json |
| List active reservations | muse coord list --json |
| Inspect DAG | muse coord dag --json |
| Shard codebase | muse coord shard --agents 4 --json |
| Forecast conflicts | muse coord forecast --json |
| Reconcile merge order | muse coord reconcile --json |
| Watch events | muse coord watch --json |
| Push to MuseHub | muse coord sync push --hub URL --owner O --slug S --json |
| Pull from MuseHub | muse coord sync pull --hub URL --owner O --slug S --json |
| GC stale records (dry-run) | muse coord gc --json |
| GC (execute) | muse coord gc --execute --json |
Default TTLs
| Entity | Default TTL |
|---|---|
| Reservation | 3600s (1h) |
| Task claim | 3600s (1h) |
| Task pending | 86400s (24h) |
| Heartbeat extension | 3600s |
| Intent | No TTL — permanent until GC with --include-intents |
| GC grace period | 300s (5 min) |