gabriel / muse public
Open #5
filed by gabriel human · 42 days ago

muse code port: semantic cross-language porting engine

0 Anchors
Blast radius
Churn 30d
0 Proposals

Vision

Every time a codebase is rewritten in a new language, the same tragedy plays out: engineers spend weeks manually reading Python (or whatever the source is), trying to understand what a function actually does at a semantic level, then transcribing it into Rust — re-discovering edge cases that the original author knew but never documented, missing invariants embedded in call ordering, and breaking contracts that existed only as tribal knowledge.

Muse already understands code at a level that no VCS has ever attempted. It has:

  • A multi-language AST parser (muse/plugins/code/ast_parser.py) that extracts SymbolRecord trees — content_id, body_hash, signature_id — for Python, Rust, TypeScript, Go, Java, C, C++, Swift, Kotlin, Ruby, and more
  • A call graph (muse/plugins/code/_callgraph.py) — both forward (ForwardGraph: what does this function call?) and reverse (ReverseGraph: what calls this?)
  • Transitive blast-radius analysis (muse/cli/commands/impact.py) — the full dependency closure of any symbol
  • Type-health analysis (muse/core/type_analysis.py) — annotation coverage, Any propagation chains, signature drift over history
  • Symbol diff (muse/plugins/code/symbol_diff.py) — rename vs. move vs. signature change vs. implementation change, all structurally classified
  • Refactor detection (muse/cli/commands/detect_refactor.py) — semantic operation history across commits
  • A persistent symbol cache (muse/core/symbol_cache.py) — sha256 → SymbolTree in msgpack, 60× faster than re-parsing
  • Dead code detection (muse/cli/commands/dead.py), coupling analysis (coupling.py), gravity/centrality (gravity.py), hotspots (hotspots.py)
  • A query engine (muse/core/query_engine.py) — walk history, evaluate predicates, extract matches across commits

muse code port is the natural next command that emerges from this stack. It answers: "Given that I want to rewrite this codebase in another language, what exactly needs to happen, in what order, and where are the hard parts?"


The Command

muse code port --from Python --to Rust [path/to/file.py] [flags]

What it produces

Phase 1 — Inventory

A complete, ordered porting manifest. Not a flat list of files — a topologically sorted work queue based on the dependency graph:

muse code port --from Python --to Rust

Porting inventory: Python → Rust
Snapshot: a3f2c9e1  497 files  14 642 symbols

Phase 1 — Leaf modules (no internal deps, port first)
  muse/core/errors.py          12 symbols   0 internal deps   ██░░░░ typed 42%
  muse/core/validation.py      8 symbols    0 internal deps   █████░ typed 83%
  muse/core/_types.py          31 symbols   0 internal deps   ██████ typed 100%

Phase 2 — Core layer (depends only on Phase 1)
  muse/core/object_store.py    24 symbols   3 deps            ████░░ typed 67%
  muse/core/schema.py          18 symbols   2 deps            ██░░░░ typed 31%
  muse/plugins/code/ast_parser.py  89 symbols  5 deps         ░░░░░░ typed 12%

...

Phase 8 — CLI layer (depends on everything)
  muse/cli/commands/impact.py  11 symbols   42 deps           ███░░░ typed 55%

────────────────────────────────────────────────────────────
Total: 8 phases  497 files  14 642 symbols
Estimated porting complexity: 847 symbol-days (see --complexity)

Phase 2 — Symbol-by-symbol translation guidance

For each symbol, emit a structured translation card:

muse code port --from Python --to Rust muse/core/object_store.py::read_object

Symbol:    muse/core/object_store.py::read_object
Kind:      async_function → Rust async fn
Type:      (object_id: str, repo_path: Path) -> bytes | None
Typed:     Yes (100%)
Callers:   47  (blast radius depth 1)
Calls:     _read_local, _read_s3, decompress_lz4, validate_object_id (4 direct)

Rust translation hints:
  str → &str or String (choose &str for borrowed, String if stored)
  Path → std::path::Path / PathBuf
  bytes | None → Option<Vec<u8>>
  async → tokio::main or async fn in async context
  LZ4 decompression → lz4_flex crate

Invariants detected (from call history + type analysis):
  - object_id always matches ^sha256:[0-9a-f]{64}$ (validated at 3 call sites)
  - Never returns empty Vec — callers assume None means absent, b"" is invalid
  - read_object is called 47× but _read_s3 is only reached when R2_BUCKET is set

Risks:
  - HIGH: async error handling style differs. Python returns None on S3 botocore
    exceptions; Rust equivalent should use Result<Option<Vec<u8>>, StorageError>
  - MED: Path::resolve() has different symlink semantics on Linux vs macOS
  - LOW: LZ4 block format vs frame format — verify crate compatibility

Phase 3 — Port progress tracking

Once you start creating .rs files alongside the originals, muse code port tracks progress:

muse code port --status

Python → Rust port progress
────────────────────────────────────
Phase 1  ████████████████████ 100%   12/12 symbols
Phase 2  ████████░░░░░░░░░░░░  42%   18/43 symbols
Phase 3  ░░░░░░░░░░░░░░░░░░░░   0%    0/89 symbols
...
────────────────────────────────────
Total:  30/497 files  (6.0%)

Next recommended: muse/core/schema.py (8 symbols, all Phase 2 deps satisfied)

Phase 4 — Semantic equivalence verification

After porting a symbol, verify semantic equivalence across language boundary:

muse code port --verify muse/core/object_store.py::read_object \
               --against src/object_store.rs::read_object

Semantic diff: Python read_object ↔ Rust read_object

  Signature match:   ✅ (str/&str, Path/PathBuf, Option<Vec<u8>>)
  Invariant coverage: ⚠️  Python validates object_id format; Rust version does not
  Error paths:       ❌  Python swallows S3 errors (returns None); Rust propagates Err
  Async model:       ✅  Both async
  Complexity:        Python cyclomatic 7, Rust cyclomatic 5 (simplified)

Architecture

New command entrypoint

muse/cli/commands/port.py

Registered under muse code port alongside the existing impact, deps, dead, etc. The registration point is the CodePlugin in muse/plugins/code/plugin.py.

Core engine

muse/plugins/code/_port_engine.py

class PortPlan(TypedDict):
    phases: list[PortPhase]
    total_files: int
    total_symbols: int
    source_lang: str
    target_lang: str

class PortPhase(TypedDict):
    phase: int
    files: list[PortFileEntry]
    all_deps_satisfied_by: list[int]  # earlier phase numbers

class PortFileEntry(TypedDict):
    path: str
    symbols: int
    internal_deps: list[str]   # other files in the repo it imports
    type_coverage_pct: float
    complexity_score: float    # cyclomatic complexity aggregate

def build_port_plan(
    manifest: Manifest,
    symbol_cache: SymbolCache,
    source_lang: str,
    target_lang: str,
) -> PortPlan: ...

def symbol_translation_card(
    address: str,
    symbol: SymbolRecord,
    reverse_graph: ReverseGraph,
    forward_graph: ForwardGraph,
    type_map: AnnotationMap,
    target_lang: str,
) -> SymbolTranslationCard: ...

Topological sort (why it matters)

The key primitive is already partially implemented via muse/plugins/code/_callgraph.py (ForwardGraph) and muse/cli/commands/deps.py (import graph). build_port_plan extends this:

  1. Build the file-level import DAG (already available via deps)
  2. Topological-sort the DAG into phases — leaf nodes (no internal imports) first
  3. Within each phase, sort by symbol count ascending (cheapest first for quick wins)
  4. Surface type coverage and cyclomatic complexity as per-file porting cost signals

This is the right order to port a codebase — it ensures that when you implement muse/core/object_store.rs, all its dependencies are already ported. Git has no concept of this ordering. It cannot even tell you the import graph.

Language type mapping

muse/plugins/code/_type_maps.py

PYTHON_TO_RUST: dict[str, str] = {
    "str":        "&str / String",
    "int":        "i64 / usize",
    "float":      "f64",
    "bool":       "bool",
    "bytes":      "Vec<u8> / &[u8]",
    "None":       "()",
    "X | None":   "Option<X>",
    "list[X]":    "Vec<X>",
    "dict[K,V]":  "HashMap<K, V>",
    "set[X]":     "HashSet<X>",
    "tuple[X,Y]": "(X, Y)",
    "Path":       "std::path::PathBuf",
    "Exception":  "Box<dyn std::error::Error>",
    "Callable":   "fn() / Box<dyn Fn()>",
    "Iterator":   "impl Iterator<Item=X>",
    "AsyncIterator": "impl Stream<Item=X>",  # futures::Stream
}

PYTHON_TO_TYPESCRIPT: dict[str, str] = { ... }
PYTHON_TO_GO: dict[str, str] = { ... }

Invariant extraction

muse/plugins/code/_invariant_extractor.py

Mining invariants from the call graph and type history — things that are enforced at runtime but not in the type signature — is the hardest and most valuable part:

  1. Regex-pattern invariants: scan callers of a function for re.match(pattern, arg) guards before the call — surface as "arg always matches pattern"
  2. Nullability invariants: trace None-check patterns; if callers always guard before using the result, document as "may return None — callers must check"
  3. Ordering invariants: detect when function A is always called before function B (via call-graph adjacency in ForwardGraph) — surface as "call ordering dependency"
  4. Historical invariants: use walk_history from muse/core/query_engine.py to scan commit messages for assert, invariant, precondition, contract keywords near the symbol

Cross-language semantic diff

muse/plugins/code/_cross_lang_diff.py

Compare a source-language symbol and its claimed equivalent in the target language:

  • Signature mapping: apply type map, flag mismatches
  • Error handling model: Python raise/None vs Rust Result/? vs Go error — detect the mismatch
  • Async model: Python asyncio vs Rust tokio/async-std vs Go goroutines
  • Ownership model: Python GC vs Rust borrow checker hints (lifetime annotations needed?)
  • Complexity comparison: cyclomatic complexity per symbol from AST walk — if Rust version is significantly higher/lower, flag for review

Port Progress as a First-Class VCS Concept

The really powerful insight: porting progress should live in Muse itself, not in a spreadsheet.

When you create src/object_store.rs as the port of muse/core/object_store.py, you annotate the relationship:

muse code port --link muse/core/object_store.py::read_object \
               --to src/object_store.rs::read_object

This writes a .muse/port-map.toml file:

[port_map]
source_lang = "Python"
target_lang = "Rust"

[[port_map.links]]
source = "muse/core/object_store.py::read_object"
target = "src/object_store.rs::read_object"
status = "ported"         # ported | in-progress | needs-review | verified
verified_at = "a3f2c9e1"  # commit at which equivalence was last verified

Now muse code port --status reads the port map and gives you a live progress dashboard. And muse code port --verify reruns semantic diff against the verified commit, flagging if the source has since drifted (new callers, changed signature, added invariants).

This is port-aware history. If read_object gets a new parameter added in commit deadbeef, muse code port --drift immediately surfaces:

PORT DRIFT DETECTED

  muse/core/object_store.py::read_object
  Last verified: a3f2c9e1
  Current:       deadbeef

  Changes since verification:
    + parameter: compress: bool = False  (added in commit deadbeef)
    Callers updated: 12 sites
    Rust equivalent: src/object_store.rs::read_object — NOT YET UPDATED

  Action: re-port or extend the Rust function to match new signature

The Python → Rust Port of Muse Itself

This feature is also the roadmap for rewriting Muse in Rust. Running muse code port --from Python --to Rust against the muse repo would output:

  • Phase 1 (leaf modules, ~40 files): muse/core/errors.py, muse/core/validation.py, muse/core/_types.py, muse/core/schema.py, muse/core/semver.py, muse/core/bip39.py
  • Phase 2 (~60 files): muse/core/object_store.py, muse/core/snapshot.py, muse/core/dag.py, muse/core/identity.py, muse/core/msign.py, muse/core/keypair.py
  • Phase 3 (~80 files): muse/core/repo.py, muse/core/store.py, muse/core/merge_engine.py, muse/core/rebase.py, muse/core/gc.py
  • Phase 4 (~50 files): muse/plugins/code/ast_parser.py (tree-sitter — native Rust tree-sitter already exists), muse/plugins/code/_callgraph.py, muse/plugins/code/_query.py
  • Phase 5+: CLI layer — muse/cli/commands/ (~170 files, high porting complexity due to argparse → clap migration)

The first deliverable of muse code port is its own porting plan.


Key Files and Symbols

File Role
muse/plugins/code/plugin.py Register port subcommand in CodePlugin
muse/plugins/code/ast_parser.py::SymbolRecord Source of content_id, body_hash, signature_id per symbol
muse/plugins/code/ast_parser.py::LanguageAdapter Protocol all language backends implement — extend for cross-lang type mapping
muse/plugins/code/_callgraph.py::ForwardGraph "What does this function call?" — foundation for port-order topological sort
muse/plugins/code/_callgraph.py::ReverseGraph "What calls this?" — surfaces blast radius of any port decision
muse/plugins/code/_callgraph.py::build_reverse_graph Entry point for building the full call graph
muse/plugins/code/_query.py::symbols_for_snapshot Extract full SymbolTree from a committed snapshot
muse/plugins/code/_query.py::language_of Language classification by file extension — already covers Python + Rust + 15 others
muse/core/symbol_cache.py::SymbolCache 60× faster symbol extraction via content-addressed cache
muse/core/type_analysis.py Annotation coverage, Any-blast-radius, migration targets — feeds type-coverage column in port plan
muse/core/query_engine.py::walk_history Walk commit DAG for invariant mining
muse/cli/commands/deps.py File-level import graph — foundation for phase topology
muse/cli/commands/impact.py Transitive blast radius — muse code port reuses this for "porting this symbol forces porting these N others"
muse/cli/commands/detect_refactor.py Semantic operation history — can detect if a Python symbol was already partially Rust-ified
muse/cli/commands/dead.py Dead code detection — symbols with no callers can be deprioritized in the port plan
muse/cli/commands/breakage.py After porting, run breakage check on the new language target
muse/core/invariants.py Existing invariant infrastructure — extend for cross-language invariant extraction
muse/cli/commands/port.py New file — the muse code port command entrypoint
muse/plugins/code/_port_engine.py New filePortPlan, PortPhase, PortFileEntry, build_port_plan
muse/plugins/code/_type_maps.py New filePYTHON_TO_RUST, PYTHON_TO_TYPESCRIPT, PYTHON_TO_GO
muse/plugins/code/_invariant_extractor.py New file — regex/nullability/ordering invariant mining
muse/plugins/code/_cross_lang_diff.py New file — semantic equivalence verification across language boundary

Success Criteria

  • muse code port --from Python --to Rust produces a topologically-sorted porting plan for the muse repo itself
  • Each symbol card includes: callers, callees, type mapping, detected invariants, risk flags
  • muse code port --link writes .muse/port-map.toml and --status reads it
  • muse code port --verify runs semantic diff between source and target symbol
  • muse code port --drift detects when source has changed since last verification
  • Port plan respects --language filter (port only muse/core/ first, not all 497 files)
  • Works for Python→Rust, Python→Go, Python→TypeScript, and any language pair where both sides have AST adapters in ast_parser.py
  • muse code port run on the muse repo produces a coherent Phase 1–8 ordering that a Rust engineer could execute sequentially with zero ambiguity

Prior Art

  • c2rust — mechanical C-to-Rust transpiler; produces unsafe Rust; no semantic understanding
  • py2rust — abandoned; no call-graph or invariant awareness
  • LLM-assisted rewrites (GPT-4, Claude) — hallucinate invariants, miss call ordering, no progress tracking
  • Muse's muse code port — the first porting tool that understands the semantic topology of a codebase and produces a dependency-ordered, invariant-annotated, progress-tracked porting plan grounded entirely in the committed symbol graph
Activity
gabriel opened this issue 42 days ago
No activity yet. Use the CLI to comment.