prometeu-runtime/files/Hard Reset.md
2026-03-24 13:40:30 +00:00

439 lines
11 KiB
Markdown

# Prometeu Industrial-Grade Refactor Plan (JVM-like)
**Language policy:** All implementation notes, code comments, commit messages, PR descriptions, and review discussion **must be in English**.
**Reset policy:** This is a **hard reset**. We do **not** keep compatibility with the legacy bytecode/linker/verifier behaviors. No heuristics, no “temporary support”, no string hacks.
**North Star:** A JVM-like philosophy:
* Control-flow is **method-local** and **canonical**.
* The linker resolves **symbols** and **tables**, not intra-function branches.
* A **single canonical layout/decoder/spec** is used across compiler/linker/verifier/VM.
* Any invalid program fails with clear diagnostics, not panics.
**Estimation scale:**
* **1 point** = small / straightforward
* **3 points** = medium
* **5 points** = large
* Any work that feels bigger must be broken down into multiple PRs (≤ 5 points each).
---
## Phase 1 — Single Source of Truth: Bytecode Spec + Decoder (Highest ROI)
### PR-01 (3 pts) — Move OpcodeSpec to `prometeu-bytecode` and make it authoritative
**Briefing**
Today opcode metadata (imm sizes, stack effects, branch-ness, terminators) is duplicated and/or inconsistent across crates. This creates a perpetual maintenance nightmare.
**Target**
Create one authoritative opcode spec in **`prometeu-bytecode`** and delete/replace all “local” opcode knowledge.
**Scope**
* Create `prometeu-bytecode::opcode_spec` containing:
* `imm_bytes`
* `pops`, `pushes` (stack effect)
* `is_branch`, `is_terminator`
* optional: `name`, `category`
* Update callers to import from `prometeu-bytecode`.
**Requirements Checklist**
* [ ] There is exactly one canonical `OpcodeSpec` source.
* [ ] All crates compile against that source.
* [ ] No hardcoded operand sizes remain outside the spec.
**Completion Tests**
* [ ] Unit test enumerating all opcodes validates:
* every opcode has a spec
* `imm_bytes` is defined
---
### PR-02 (5 pts) — Introduce canonical decoder in `prometeu-bytecode` and migrate VM to it
**Briefing**
The VM currently has its own decoder. The linker and other tools decode manually. This must be centralized.
**Target**
Add a single canonical decoder in `prometeu-bytecode` that produces typed decoded instructions.
**Scope**
* Add `prometeu-bytecode::decoder`:
* `decode_next(pc, bytes) -> DecodedInstr`
* includes: opcode, pc, next_pc, raw immediate bytes slice
* helpers: `imm_u8/u16/u32/i32/i64/f64` with size validation
* Migrate VM to use `prometeu-bytecode::decoder`.
**Requirements Checklist**
* [ ] VM no longer has a bespoke decoder.
* [ ] No slicing-based immediate parsing in VM core paths.
* [ ] Decoder validates immediate sizes and fails deterministically.
**Completion Tests**
* [ ] Decoder unit tests for representative opcodes with each immediate size.
* [ ] Roundtrip test: encode→decode (table-driven; property test optional).
---
### PR-03 (3 pts) — Delete/neutralize `abi::operand_size` duplication
**Briefing**
`prometeu-bytecode/src/abi.rs` provides partial operand sizing that can drift from the canonical spec.
**Target**
Make all operand sizing derived from the opcode spec.
**Scope**
* Replace `operand_size()` with `OpcodeSpec::imm_bytes`.
* Remove or restrict legacy APIs that leak duplication.
**Requirements Checklist**
* [ ] There is no second operand-size table.
**Completion Tests**
* [ ] Test ensuring `operand_size()` (if retained) matches spec for all opcodes.
---
## Phase 2 — Canonical Layout + Verifier Contract (JVM-like Control Flow)
### PR-04 (5 pts) — Rewrite layout to compute instruction boundaries via decoder (no heuristics)
**Briefing**
Layout must be computed canonically using the decoder, not guessed via ad-hoc stepping.
**Target**
`prometeu_bytecode::layout` becomes the only authority for:
* function ranges `[start, end)`
* function length
* valid instruction boundaries
* pc→function lookup
**Scope**
* Implement layout computation by scanning bytes with the canonical decoder.
* Provide APIs:
* `function_range(func_idx) -> (start, end)`
* `function_len(func_idx)`
* `is_boundary(func_idx, rel_pc)` or `is_boundary_abs(abs_pc)`
* `lookup_function_by_pc(abs_pc)`
**Requirements Checklist**
* [ ] No “clamp_jump_target” or tolerant APIs remain.
* [ ] Layout derived only via decoder.
**Completion Tests**
* [ ] Unit tests: boundaries for a known bytecode sequence.
* [ ] Fuzz/table tests: random instruction sequences produce monotonic ranges and valid boundaries.
---
### PR-05 (3 pts) — Verifier hard reset: branches are function-relative only
**Briefing**
The verifier must not guess absolute vs relative. One encoding only.
**Target**
Branches use `immediate = target_rel_to_function_start`, with `target == func_len` allowed.
**Scope**
* Replace any dual-format logic.
* Validation:
* `target_rel <= func_len`
* if `target_rel == func_len`: OK (end-exclusive)
* else target must be an instruction boundary
* All boundary checks must come from `layout`.
**Requirements Checklist**
* [ ] No heuristics.
* [ ] Verifier depends only on layout + decoder.
**Completion Tests**
* [ ] JumpToEnd accepted.
* [ ] JumpToMidInstruction rejected.
* [ ] JumpOutsideFunction rejected.
---
### PR-06 (3 pts) — Linker hard reset: never relocate intra-function branches
**Briefing**
Linker must not rewrite local control-flow.
**Target**
Remove any relocation/patching for `Jmp`/`JmpIf*`.
**Scope**
* Delete branch relocation logic.
* Ensure only symbol/table/call relocations remain.
**Requirements Checklist**
* [ ] Linker does not inspect/patch branch immediates.
**Completion Tests**
* [ ] Link-order invariance test (A+B vs B+A) passes for intra-function branches.
---
## Phase 3 — JVM-like Symbol Identity: Signature-based Overload & Constant-Pool Mindset
### PR-07 (5 pts) — Introduce Signature interning (`SigId`) and descriptor canonicalization
**Briefing**
Overload must be by signature, not by `name/arity`.
**Target**
Create a canonical function descriptor system (JVM-like) and intern signatures.
**Scope**
* Add `Signature` model:
* params types + return type
* Add `SignatureInterner` -> `SigId`
* Add `descriptor()` canonical representation (stable, deterministic).
**Requirements Checklist**
* [ ] `SigId` is used as identity in compiler IR.
* [ ] Descriptor is stable and round-trippable.
**Completion Tests**
* [ ] `debug(int)->void` and `debug(string)->void` produce different descriptors.
* [ ] Descriptor stability tests.
---
### PR-08 (5 pts) — Replace `name/arity` import/export keys with `(name, SigId)`
**Briefing**
`name/arity` and dedup-by-name break overload and are not industrial.
**Target**
Rewrite import/export identity:
* `ExportKey { module_path, base_name, sig }`
* `ImportKey { dep, module_path, base_name, sig }`
**Scope**
* Update lowering to stop producing `name/arity`.
* Update output builder to stop exporting short names and `name/arity`.
* Update collector to stop dedup-by-name.
**Requirements Checklist**
* [ ] No code constructs or parses `"{name}/{arity}"`.
* [ ] Overload is represented as first-class, not a hack.
**Completion Tests**
* [ ] Cross-module overload works.
* [ ] Duplicate export of same `(name, sig)` fails deterministically.
---
### PR-09 (3 pts) — Overload resolution rules (explicit, deterministic)
**Briefing**
Once overload exists, resolution rules must be explicit.
**Target**
Implement a deterministic overload resolver based on exact type match (no implicit hacks).
**Scope**
* Exact-match resolution only (initially).
* Clear diagnostic when ambiguous or missing.
**Requirements Checklist**
* [ ] No best-effort fallback.
**Completion Tests**
* [ ] Ambiguous call produces a clear diagnostic.
* [ ] Missing overload produces a clear diagnostic.
---
## Phase 4 — Eliminate Stringly-Typed Protocols & Debug Hacks
### PR-10 (5 pts) — Replace `origin: Option<String>` and all string protocols with structured enums
**Briefing**
String prefixes like `svc:` and `@dep:` are fragile and non-industrial.
**Target**
All origins and external references become typed data.
**Scope**
* Replace string origins with enums.
* Update lowering/collector/output accordingly.
**Requirements Checklist**
* [ ] No `.starts_with('@')`, `split(':')` protocols.
**Completion Tests**
* [ ] Grep-based test/lint step fails if forbidden patterns exist.
---
### PR-11 (5 pts) — DebugInfo V1: structured function metadata (no `name@offset+len`)
**Briefing**
Encoding debug metadata in strings is unacceptable.
**Target**
Introduce a structured debug info format that stores offset/len as fields.
**Scope**
* Add `DebugFunctionInfo { func_idx, name, code_offset, code_len }`.
* Remove all parsing of `@offset+len`.
* Update orchestrator/linker/emit to use structured debug info.
**Requirements Checklist**
* [ ] No code emits or parses `@offset+len`.
**Completion Tests**
* [ ] A test that fails if any debug name contains `@` pattern.
* [ ] Debug info roundtrip test.
---
## Phase 5 — Hardening: Diagnostics, Error Handling, and Regression Shields
### PR-12 (3 pts) — Replace panics in critical build pipeline with typed errors + diagnostics
**Briefing**
`unwrap/expect` in compiler/linker transforms user errors into crashes.
**Target**
Introduce typed errors and surface diagnostics.
**Scope**
* Replace unwraps in:
* symbol resolution
* import/export linking
* entrypoint selection
* Ensure clean error return with context.
**Requirements Checklist**
* [ ] No panic paths for invalid user programs.
**Completion Tests**
* [ ] Invalid program produces diagnostics, not panic.
---
### PR-13 (3 pts) — Add regression test suite: link-order invariance + opcode-change immunity
**Briefing**
We need a system immune to opcode churn.
**Target**
Add tests that fail if:
* linker steps bytes manually
* decoder/spec drift exists
* link order changes semantics
**Scope**
* Link-order invariance tests.
* Spec coverage tests.
* Optional: lightweight “forbidden patterns” tests.
**Requirements Checklist**
* [ ] Changing an opcode immediate size requires updating only the spec and tests.
**Completion Tests**
* [ ] All new regression tests pass.
---
## Summary of Estimated Cost (Points)
* Phase 1: PR-01 (3) + PR-02 (5) + PR-03 (3) = **11**
* Phase 2: PR-04 (5) + PR-05 (3) + PR-06 (3) = **11**
* Phase 3: PR-07 (5) + PR-08 (5) + PR-09 (3) = **13**
* Phase 4: PR-10 (5) + PR-11 (5) = **10**
* Phase 5: PR-12 (3) + PR-13 (3) = **6**
**Total: 51 points**
> Note: If any PR starts to exceed 5 points in practice, it must be split into smaller PRs.
---
## Non-Negotiables
* No compatibility with legacy encodings.
* No heuristics.
* No string hacks.
* One canonical decoder/spec/layout.
* Everything in English (including review comments).