Skip to content

perf: precisely short-circuit exact forwarders (fmodf et al. libm calls) and zero-forwarders (Motoko-specific)#5961

Draft
ggreif wants to merge 54 commits intomasterfrom
gabor/forward
Draft

perf: precisely short-circuit exact forwarders (fmodf et al. libm calls) and zero-forwarders (Motoko-specific)#5961
ggreif wants to merge 54 commits intomasterfrom
gabor/forward

Conversation

@ggreif
Copy link
Copy Markdown
Contributor

@ggreif ggreif commented Mar 31, 2026

Summary

Short-circuit two classes of Wasm call-indirection in the linker, before RTS imports are resolved:

  • Exact forwarders (libm/RTS side). forwarding_target in src/linking/linkModule.ml detects functions whose body is local.get 0 … local.get n-1; call k — the Rust #[no_mangle] wrappers for fmod, fmodf, pow, sin, cos, …. chase_forwarders rewrites the RTS exports map so each export resolves directly to its terminal callee, bypassing the wrapper. The moc-generated call site then hits $libm::… directly.

  • Zero-forwarders (Motoko-specific). zero_forwarder_target detects Motoko top-level functions whose body is i{32|64}.const 0; local.get 1 … local.get n-1; call k — they forward all arguments through, passing a zero as the closure slot. A new abstract interpreter (src/linking/constTrack.ml{,.mli}, ~495 lines) propagates integer constants over the Wasm operand stack: straight-line code, Block/Loop/If join points, Br/BrIf/BrTable, Select, Compare/Test, Unary/Convert, with local-tracking and FromLocal backprop. At each Call k where k is a 0-forwarder and the closure arg on the stack is i{32|64}.const 0 specifically, the call is rewritten to the chain's terminal. A nonzero constant is deliberately rejected: the forwarder would have overwritten the closure with zero, so eliding it would leak the caller's value to a callee that may inspect $clos.

Forwarder-chain compression (termination + cycle safety)

Both forwarder maps (chase_forwarders on RTS exports and the 0-forwarder map on em1 locals) are compressed via Kleene pointer-jumping with a difference-map output: each step expands every entry's target via the current map; iteration terminates when the diff is empty, or a fuel bound of cardinal raw kicks in (guarding against odd-length cycles where pointer-jumping oscillates rather than converges). A post-filter drops any entry whose compressed target is still a key of the converged map — those sit on or lead into a cycle and have no well-defined terminal. With the map compressed and cycle-free, the call-site rewriter needs one pass and cannot loop.

Nested-block-safe rewrite via physical identity

ConstTrack.process_block_inner recurses into Block/Loop/If bodies with a fresh idx=0, so positional keys would mis-target calls inside nested blocks. Rewrites are therefore keyed on the physical identity of the instr phrase (ConstTrack.same_instr, backed by OCaml ==). Apply walks the body recursively, descending into the three nested forms. Sound under the linker's tree-shaped IR invariant (no phrase sharing, no position aliasing); documented in constTrack.mli.

constTrack.mli is a public interface — the abstract interpreter is reusable by future linker passes.

Tests

test/run/fmodf-forward.mo (FileCheck, classical + EOP + wasm64):

  • fmodf call site reaches $libm::…fmodf… directly.
  • Motoko foo → bar → quux chain collapses: foo calls quux directly (bar is a 0-forwarder).
  • baz (which adds 1.0 after calling quux) is not collapsed — not a pure forwarder.
  • Unreferenced $fmodf wrapper remains in the binary (to be DCE'd by wasm-opt).

test/run-drun/blob-dedup-upgrade.drun under --sanity-checks --enhanced-orthogonal-persistence — this was the regression that motivated the nested-block fix. The sanity-check instrumentation introduces nested Blocks around calls; without physical-identity keying, the rewriter landed on the wrong instruction and produced expected [i64, i64] but got [i64] at load time.

test/run.sh — small portability fixes for macOS 26.4 (tr -d '\000', paste -sd' ' -).

Follow-up

PR #5964 is stacked on this one and handles the case chase_forwarders cannot: call_indirect for capturing closures (foo_clos → bar.1 → quux.1) where the function index is statically known but the closure carries a captured value.

Design notes

  • .claude/plans/abstract-interpreter.md
  • .claude/plans/call-forwarding.md
  • .claude/plans/wasm-exts-update.md

Test plan

  • make -C test/run fmodf-forward.only[tc] [run] [comp] [valid] [FileCheck] [wasm-run] pass (classical, EOP, wasm64).
  • make -C test/run-drun blob-dedup-upgrade.only under --sanity-checks --enhanced-orthogonal-persistence — passes.
  • make -C test/run -j10 quick — all tests pass.

🤖 Generated with Claude Code

@ggreif ggreif requested a review from a team as a code owner March 31, 2026 12:30
@ggreif ggreif self-assigned this Mar 31, 2026
@ggreif ggreif marked this pull request as draft March 31, 2026 12:41
@ggreif ggreif changed the title perf: chase forwarding wrappers in the linker (chase_forwarders) perf: chase forwarding wrappers in the linker and short-circuit them Mar 31, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 31, 2026

Comparing from bc32e40 to edade2b:
In terms of gas, 5 tests improved and the mean change is -0.5%.
In terms of size, 5 tests regressed and the mean change is +0.0%.

Comment thread test/bench/ok/alloc.drun-run.ok Outdated
ingress Completed: Reply: 0x4449444c016c01b3c4b1f204680100010a00000000000000000101
ingress Completed: Reply: 0x4449444c0000
debug.print: (+671_088_640, 5_522_892_825)
debug.print: (+671_088_640, 5_405_452_313)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instruction count reduced: 5_522_892_825 → 5_405_452_313 (−1.1%)

Comment thread test/bench/ok/heap-32.drun-run.ok Outdated
ingress Completed: Reply: 0x4449444c0000
debug.print: (50_227, +75_320_504, 1_153_792_735)
debug.print: (50_070, +86_221_288, 1_256_407_315)
debug.print: (50_227, +75_320_504, 1_153_792_714)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instruction count reduced: 1_153_792_735 → 1_153_792_714 (−21 instrs)

Comment thread test/bench/ok/heap-64.drun-run.ok Outdated
Comment thread test/bench/ok/palindrome.drun-run.ok Outdated
Comment thread src/linking/linkModule.ml
let rec fixpoint f x = let x' = f x in if x' = x then x else fixpoint f x'

let chase_forwarders (funcs : func array) (types : Wasm_exts.Types.func_type list)
(import_count : int) (exports : exports) : exports =
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too convoluted

Comment thread src/linking/linkModule.ml
let idx = Int32.to_int fi - import_count in
if idx < 0 then None
else
let f = funcs.(idx) in
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is AI-written, needs rewrite to make it nicer.

@ggreif ggreif added the opportunity More optimisation opportunities inside label Mar 31, 2026
@ggreif ggreif changed the base branch from gabor/float32-peephole to master March 31, 2026 19:01
Comment thread test/run/forward-calls.mo
Comment thread test/run/forward-calls.mo
Comment thread .claude/plans/call-forwarding.md
Comment thread .claude/plans/call-forwarding.md Outdated

### Safety

The replacement is safe at any call site that supplies **any `i32.const k`**
Copy link
Copy Markdown
Contributor Author

@ggreif ggreif Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a lie. We need a strict guarantee that const 0 is at the exact right stack depth. See section "Precise call-site eligibility via stack-depth tracking" below!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the Safety paragraph is loose. "Any i32.const k before the call" isn't enough; the const has to be the value on the operand stack at depth n-1 (0-indexed from TOS, with n = callee arity) at the instant the call executes. That is exactly what the "Precise call-site eligibility via stack-depth tracking" section below describes, and what the shipped code checks at each call site via ConstTrack.lookup state (n-1) in the on_call callback.

I'll tighten this subsection so it points at the stack-depth criterion rather than implying a textual/peephole match, and remove the "any i32.const k" framing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned out there was a second, independent bug of the same spirit: the shipped on_call predicate in linkModule.ml (line 1113) matched any ConstTrack.I32 _ | I64 _ at the closure-arg depth, not zero specifically — a regression from 80ae9b7 when the flat-array heuristic (which had I32 0l | I64 0L) was replaced by the abstract interpreter.

A nonzero caller constant would have reached the terminal $k unchanged after elision, which is unsafe if $k inspects $clos (not a linker-provable invariant for an arbitrary callee). Tightened the predicate to Some (ConstTrack.I32 0l | ConstTrack.I64 0L) and rewrote the Safety subsection to match. b87813a.

Tests still pass because moc only emits i32.const 0 for top-level named calls — but the previous code was one malicious caller away from producing wrong Wasm.

Comment thread .claude/plans/call-forwarding.md Outdated
but for let-bound closure values it emits a **static closure object pointer**
(e.g. `i32.const 2097251`). The worker function ignores its received `$clos`
and synthesises its own `i32.const 0` for the callee — so the caller's
constant is irrelevant to the callee's behaviour.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callee might examine the passed in closure and use it when not zero. So short circuiting is only permissible when the call-site really passes a zero.

@ggreif ggreif added the performance Affects only gas usage or code size label Apr 1, 2026
ggreif and others added 9 commits April 18, 2026 20:19
…types

Switch constTrack from Wasm.Ast to Wasm_exts.Ast types (matching linkModule's
namespace). Add diagnostic pass in linkModule.ml that runs the abstract
interpreter over each function body. Fix OCaml name resolution issues:
open Wasm.Source for .it, extract is_zero helper, Select without wildcard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er rewriter

The old call-site rewriter used arr[i - param_count] to check if the closure
arg was Const 0 — unsound when arguments span multiple instructions. Now uses
constTrack's abstract interpreter with proper stack-depth tracking via
ConstTrack.lookup lru (n_params - 1). Adds on_call callback to process_block
so linkModule can inspect the LRU at each call site.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Process Block bodies recursively, intersect LRU states at If join points
(constants that agree in both branches survive). Loop conservatively flushes.
VarBlockType resolution via optional type_section callback. Unlocks 11 new
files with zero-forwarder rewrites that were previously invisible behind
control flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes redundant local opens and qualified prefixes throughout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…filter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Propagate constants through Select: known condition picks the right
operand, equal operands propagate regardless of condition.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BrIf continues on the fall-through path (pop condition, proceed).
Fix Block and If handlers: body depth tracking is authoritative — no
exit shift needed (was double-counting). Evict result slots at join
points conservatively since BrIf-taken paths may carry different values.

Document known pessimisations and Phase 3 plan for precise joins via
OCaml 5.x algebraic effects (May_leave effect + depth decrement at
each Block boundary).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All three branch instructions emit May_leave. BrTable is just an iter
over targets — each block collects the same LRU. fold_left intersect
at the join resolves both known pessimisations. Draft OCaml 5.3 PR exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove constTrack eprintf and chase_forwarders eprintf
- Remove unused is_zero helper
- Pass type_section callback so VarBlockType blocks are resolved

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ggreif and others added 4 commits April 19, 2026 02:41
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add section to call-forwarding.md explaining the stack-depth + LRU-zeros-set
approach as a sound replacement for the current flat-array offset heuristic.
Includes per-instruction-class delta table (const/get/set/tee/unary/binary/call),
control-flow handling, and a clarification that this only applies to the
0-forwarder call-site rewriter, not chase_forwarders.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pure LRU cache keyed by stack depth tracks integral constants through
straight-line Wasm code. Supports constant folding for i32/i64
arithmetic, continues past `call` with stack delta adjustment, and
dumps to stderr when a known zero is in the LRU at a call site
(potential forwarder elimination candidate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ggreif ggreif changed the title perf: chase forwarding wrappers in the linker and short-circuit them perf: precisely short-circuit exact forwarders (fmodf et al. libm calls) and zero-forwarders (Motoko-specific) Apr 19, 2026
ggreif and others added 5 commits April 19, 2026 03:53
Regression from 80ae9b7 (migration from flat-array heuristic to
ConstTrack): the new `on_call` predicate matched any `I32 _ | I64 _`
at the closure-arg stack depth, where the flat-array code had checked
`I32 0l | I64 0L` specifically.

A 0-forwarder `$foo` overwrites its received closure with `i32.const 0`
before calling `$k`. Eliding `$foo` lets the caller's closure value
reach `$k` unchanged. If the caller had supplied a nonzero constant
and `$k` inspects `$clos` (not a locally-provable invariant for an
arbitrary callee), behaviour diverges. Tightening to zero makes the
rewrite an identity at `$k`'s entry.

Tests still green: moc emits `i32.const 0` for top-level named calls,
which is what the tightened predicate matches.

Also rewrites the Safety subsection in .claude/plans/call-forwarding.md
to document the correct criterion and explain why "any i32.const k" is
unsound.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ycles

Without cycle detection, the em1 0-forwarder rewriter could loop forever
when two or more top-level Motoko functions formed a cyclic forwarder
chain — moc would hang at compile time. Reproduced on classical-persistence
compilation of test/run/multi-value-block.mo.

Introduce `compress_forwarder_map`: one-shot Kleene iteration on an
Int32Map of raw one-hop mappings, emitting a difference-map per step for
termination. Bounded at [cardinal raw] iterations; on exit, any entry
whose target is still a key of the converged map sits on (or leads into)
a cycle and is dropped — it has no well-defined terminal, so the rewrite
is unsafe. After compression the rewriter needs at most one pass and
cannot loop.

Applied to both sites:

- `chase_forwarders` (RTS exports side) — builds raw map via
  `forwarding_target`, compresses, single-shot `NameMap.map`. Previously
  used a generic `fixpoint` helper that also had no cycle guard.

- em1 0-forwarder rewriter — builds raw map via `zero_forwarder_target`,
  compresses, then walks each function body once. Rewriter rewritten in
  functional style: `collect_rewrites` returns a difference-map (instr
  index → new target) from `ConstTrack.process_block`; `apply_rewrites`
  applies it via `List.mapi`. No more `rec loop`, no `any_changed`/
  `changed` refs, no in-place array mutation.

Verified green:
- test/run/multi-value-block.mo (was hanging): completes in 1.7s
- test/run/fmodf-forward.mo (regression check): unchanged
- test/run -j8 quick: full suite passes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression from e98cfec: the functional rewriter collected rewrites
keyed by ConstTrack's instruction index and applied them via
`List.mapi` on the top-level body. But `ConstTrack.process_block_inner`
recurses into `Block`/`Loop`/`If` bodies with a fresh `idx=0`, so the
callback receives a block-relative index, not a body-absolute one. A
rewrite fired inside a nested block would land at a top-level position
with that `idx` — the wrong (and typically differently-typed)
instruction — producing Wasm that fails validation at link time.

Observed under `--sanity-checks --enhanced-orthogonal-persistence` in
`test/run-drun/blob-dedup-upgrade`: the inserted sanity-check call at
the top-level shared an index with a 0-forwarder call inside a nested
`block`, so the rewriter wrote the forwarder's target over the sanity
call, yielding `expected [i64, i64] but got [i64]` at load time.

Fix: key rewrites on the `instr` phrase's *physical identity* (OCaml
`==`), which uniquely pins the exact AST node regardless of its depth
in the block tree. Apply by walking the body recursively, descending
into `Block`/`Loop`/`If` bodies. Relies on the linker's Wasm IR being
tree-shaped (no phrase sharing, no position aliasing) — a property
held by moc's IR.

Verified:
- test/run-drun/blob-dedup-upgrade.drun under sanity+EOP (was failing)
- test/run/multi-value-block.mo (was hanging pre-compression)
- test/run/fmodf-forward.mo (no regression)
- manual wasm-validate on the compiled output: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 0-forwarder rewriter keys its rewrite set on OCaml (==) of the
instruction phrase to sidestep ConstTrack's block-relative idx. Lift
that comparison into a named predicate on the abstract-interpreter's
public API so the tree-shape-IR invariant it relies on is documented
in one place rather than inlined at the call site.

OCaml inlines the definition, so no runtime cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented Apr 20, 2026

Independent review summary (soundness pass)

Dispatched an independent review agent over the three fix commits (b87813adb, e98cfeccb, 8e2f8dc8e, plus the follow-up refactor 8a999ffec). Gist of its findings:

Sound — nothing wrong

  • Zero-specific closure predicate (b87813adb): tight. The forwarder overwrites $clos with 0 before calling its callee; only a caller-supplied 0 preserves semantics at the terminal's entry. Matching I32 0l | I64 0L is correct; "any i32.const k" would leak nonzero values to a callee that may inspect $clos.

  • Path compression with fuel + cycle drop (e98cfeccb): termination verified explicitly for length-2 and length-3 cycles. Length-2 collapses to self-loops in one step (both still in dom(m')); length-3 oscillates permutationally with period 3, so by k=cardinal all cycle entries still point into dom(m'). The post-filter not (Int32Map.mem t m') correctly drops both cyclic nodes and tail nodes leading into a cycle, since after Kleene convergence any non-terminal-reaching entry still has m'[fi] ∈ dom(m').

  • Physical-identity rewrite (8e2f8dc8e, refactored in 8a999ffec): the reasoning is right — process_block_inner recurses with fresh idx=0, so a positional key would collide across nesting levels. Since moc's IR is tree-shaped (no phrase sharing), == is a correct key. The rewrite_instr descent mirrors the three recursive cases in ConstTrack.step.

Fragile — flagged, not errors

  • Fuel bound cardinal raw is conservative; ⌈log₂ cardinal⌉ would be tighter. Non-issue at the expected scale (<100 entries).
  • == soundness depends on the non-local "no phrase sharing" invariant. Now documented via ConstTrack.same_instr so any future optimisation that introduces memoisation of phrases would trip over the documented assumption.
  • on_call doesn't re-check n_params ≥ 1; safe today via zero_forwarder_target's own guard at linkModule.ml:234, but the coupling is implicit.
  • Cross-width closures (i32/i64 mix under the wasm64 transition) — safe provided type-checking agrees on width, which upstream enforces.

Alternatives considered, not taken

  • Tarjan SCC for exact cycle detection — more defensible than fuel, but fuel is cheaper and sufficient here.
  • Pushing nested-block awareness into ConstTrack's API (e.g., absolute-path key) — addressed instead by exposing ConstTrack.same_instr as a named helper so the identity invariant is documented in one place.
  • chase_forwarders (dm2/RTS exports) and the em1 rewriter operate on disjoint function arrays and disjoint maps; ordering is independent.

Bottom line

No correctness issues found across the four commits.

ggreif added a commit that referenced this pull request Apr 22, 2026
Moves the plan file here from PR #5961's branch where it sat as a
leftover companion doc. Outlines the forthcoming `wasm-exts` sync
work — pulling upstream instruction support forward so codegen can
emit SIMD, newer ref-types, etc. This PR is the natural home.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moved to PR #6043 (wasm-exts sync), which is the natural home for the
upcoming instruction-catchup work. No content loss — file is
reproduced verbatim on gabor/wasm-exts-sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ggreif and others added 6 commits April 22, 2026 14:45
Replaces the Phase 2 `has_br_if` eviction heuristic with a precise join
that accumulates branch-target LRU states via OCaml 5.3 algebraic
effects.

Each `Br`/`BrIf` performs `May_leave (n, lru)` carrying the branch's
label depth and outgoing LRU state. The nearest enclosing `Block`/`If`
handler catches `(0, lru)` into a `branch_states` list; `(n>0, lru)`
is re-raised as `(n-1, lru)` so effects propagate de-Bruijn-style
through nested label-pushing constructs. `Loop` swallows `(0, _)`
(back-edges; no fixed-point iteration) and re-raises higher depths.

The join unifies fall-through and explicit `Br`-to-End: after body
processing, any `Some lru'` return is normalised as a synthetic
`May_leave (0, lru')` so it joins `branch_states` symmetrically.
The final state is `List.fold_left intersect` over the list —
precision now matches the Wasm semantics rather than the conservative
"evict result slots whenever a BrIf was seen" of Phase 2.

`If` is structurally identical to `Block` after this refactor — both
legs feed the same `branch_states` via the same handler, no special-
casing of then/else fall-through needed. `process_block` installs a
defensive top-level handler that swallows any `May_leave` leaking
past the outermost block.

Eliminates the documented pessimisations:
- BrIf-less blocks with constant results no longer lose them
- Blocks where all branch paths agree on result values preserve them

Build inside `nix develop` (OCaml 5.3 required for `Effect`).
`test/run/fmodf-forward` (FileCheck) and `capture-mut` (runtime) pass
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tates

The three label-pushing constructs (Block, If, Loop) had copies of the
same effect-handler skeleton — catch (0, lru), catch (n>0, _) and
re-raise as (n-1, _). Parametrise the depth-0 action as on_zero and
the fall-through-normalisation as a boolean, so:

  - Block  : ~on_zero:(collect into branch_states)  ~normalise:true
  - If     : (same on_zero for both legs)            ~normalise:true
  - Loop   : ~on_zero:(fun _ -> ())                  ~normalise:false

The fold-intersect of the collected branch states is extracted as
`join_branch_states`. If's two legs become two invocations of the
same runner — the structural symmetry with Block is now explicit in
the code, not just in the comment.

Net: -72 lines, -15 nested record literals. Behaviour unchanged;
test/run/fmodf-forward and capture-mut still pass identically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a linker-level regression test demonstrating that Phase 3 of
ConstTrack (precise branch joins via algebraic effects) is what gates
the zero-forwarder call-site rewrite in `collect_rewrites`.

$bar is a syntactic zero-forwarder — body is `i32.const 0; local.get 1;
call $quux` so `zero_forwarder_target` returns `Some $quux`.

$foo calls $bar with a closure arg produced by a Block whose body
contains a BrIf and whose fall-through + branch-taken paths both push
`i32.const 0`. Phase 2 of ConstTrack evicted the Block's result-slot
entries whenever a BrIf was seen, so the `0` at depth 1 at the
`call $bar` site was lost and `collect_rewrites` wouldn't fire.
Phase 3 takes the intersection of fall-through with every Br-target
LRU, so the shared `i32.const 0` survives and the linker rewrites
$foo's `call $bar` to `call $quux`.

Verified by temporarily reverting to pre-Phase-3 ConstTrack and
confirming the .ok file diff fires with `call $bar` instead of
`call $quux`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the four precision-gain examples the review agent flagged:

  $nested_block      — reviewer ex. 1, nested Block de-Bruijn depth:
                       Br 1 from inner (with BrIf) + Br 1 from
                       fall-through, both carrying `i32.const 0` to
                       the outer Block's End.

  $block_over_loop   — reviewer ex. 2, Block { Loop { Br 1 } }:
                       back-edge swallowed, Br 1 decremented through
                       Loop handler, outer Block catches the 0.

  $agreeing_brif     — reviewer ex. 3, canonical agreeing-BrIf:
                       BrIf-taken and fall-through paths both push
                       0; Phase 3 intersects, keeps it.

  $agreeing_if_legs  — reviewer ex. 4, If with agreeing legs (one
                       has a BrIf that targets the function label):
                       shared branch_states collects both legs'
                       normalised fall-throughs, intersect keeps 0.

Verified each case is a strict Phase-3 regression detector by
temporarily reverting to pre-Phase-3 constTrack.ml: all four
functions' `call $bar` sites drop back to unrewritten (no call to
$quux), producing a four-way diff against the committed .ok. Under
Phase 3, every call site gets rewritten and the .ok matches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented Apr 22, 2026

Phase 3 (constTrack) just landed on this branch

Three commits implement the precise branch-join design sketched in .claude/plans/abstract-interpreter.md § "Phase 3: Precise branch joins via algebraic effects", plus a fourth with a linker-level regression test:

  • 7d815c4d1 feat(constTrack): Phase 3 — precise branch joins via algebraic effects
  • 14c04cd6b refactor(constTrack): extract run_under_leave_handler + join_branch_states
  • e96295b56 test(ld): zero-forwarder regression test exercising Phase 3 precision
  • c58d0f6f6 test(ld): expand zero-fwd-join to 4 Phase-3 regression cases

Review-agent verdict

An independent reviewer with an effects / SSA / Wasm brief walked the two constTrack commits, traced two nested cases step-by-step (Block { Block { If { Br 2 } } } and Block { Loop { Br 1 } }) through the handler chain, and reported:

  • No correctness bugs. Effect semantics hold: Effect.perform inside an effc function re-dispatches to the next outer handler (OCaml 5.3 built-in), so the de-Bruijn decrement chain (n → n-1 → ... → 0) works as intended. Every handler invokes continue k () exactly once per branch; no continuation is dropped or double-invoked.
  • No soundness regression vs Phase 2. Every scenario considered is at least as precise as Phase 2 and strictly better in three: (1) no-BrIf blocks preserve constants the has_br_if hack left alone; (2) BrIf-taken and fall-through paths that agree on a result slot now retain the constant where Phase 2 bulk-evicted; (3) unconditional Br 0 in a Block now joins only the Br-taken LRU, where Phase 2 cleared entries. BrIf's sense=false refinement remains applied to fall-through only — semantically correct and unchanged.
  • Refactor is behaviour-preserving. The three inlined handlers had identical n>0 re-raise arms; extracting run_under_leave_handler changes nothing semantically. The shared branch_states ref across If's two legs works because the two run calls evaluate sequentially; join_branch_states is order-independent (intersect is commutative/associative). normalise:false + swallowing on_zero recovers the Loop variant verbatim.

Regression detector

The reviewer's four suggested test cases live in test/ld/zero-fwd-join.wat as four exported call sites:

Reviewer's example Function name in test
Nested-Block de-Bruijn depth $nested_block
Block { Loop { Br 1 } } $block_over_loop
Agreeing BrIf (canonical) $agreeing_brif
If with agreeing legs $agreeing_if_legs

Each function ends with call $bar where $bar is a syntactic zero-forwarder (body: i32.const 0; local.get 1; call $quux). Phase 3 sees the i32.const 0 at depth n_params - 1 through each pattern and lets collect_rewrites rewrite every call $barcall $quux. Verified by temporarily reverting constTrack.ml to the pre-Phase-3 commit: all four call sites regress to call $bar (unrewritten), producing a four-way diff against the committed .ok. Under Phase 3 the .ok matches verbatim.

Still unblocked from the plan doc

Phase 3 was gated on "OCaml 5.3 migration (draft PR exists)" — that's now f1e676ae0 on master. This branch merged master in 7771c1f9a, so Phase 3 could finally land. Any future revert of Phase 3 is caught by the test/ld/zero-fwd-join diff.

…leanups

Small polish pass after Phase 3 landed:

  - Drop redundant `Wasm_exts.Ast.` qualifier on `same_instr`'s param
    types — `instr` is already in scope via `open Wasm_exts.Ast`.

  - Replace intermediate-named bindings (`inner_lru`, `lru_cond`,
    `entry_lru`) with shadowed `lru` where the original is not needed
    after derivation. Applies to Block, If, and Loop. The handler
    callbacks' argument is renamed from `lru_b` to `lru` (shadowing
    the enclosing `lru` only within the callback body).

  - Collapse Loop further: the inner view, the state passed to the
    body, and the exit state are all equal to the flushed lru (empty
    entries + empty locals make depth shifts vacuous). Drop the
    `n_params`/`n_results` bindings and the trailing shift.

  - Pipe `perform (May_leave ...)` as `May_leave ... |> Effect.perform`
    in all four sites — reads left-to-right like a "send this effect"
    rather than a function call on a constructor.

Pure syntactic / naming cleanup; no semantic change. Verified:
test/ld/zero-fwd-join (4 Phase-3 regression cases) and test/run/
fmodf-forward (FileCheck on emitted Wasm) both pass unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the placeholder `BrTable _ -> None` arm with the full
semantics: pop the index, then emit one `May_leave (label_depth t, _)`
per target (listed targets + default), then return `None` like any
other terminator. Each target's enclosing handler catches via the
same de-Bruijn decrement chain as `Br`/`BrIf`.

This plugs the last structural gap in Phase 3's control-flow coverage:
before this, any block whose only exit was a `br_table` would return
`None` with no contributions, so the Block's branch_states stayed
empty and the outer join collapsed to None — the analyser "couldn't
see past" the BrTable.

Fifth case `$br_table_agree` added to test/ld/zero-fwd-join.wat: a
Block whose body is `i32.const 0; local.get 0; br_table 0 0 0 0`
(three listed targets + default, all targeting the Block). Every
target collects `[d0=I32 0]`; intersect keeps the 0; the call site
gets rewritten to `$quux`. Verified as a strict regression detector
by temporarily reverting only the BrTable arm — produces exactly one
`call $quux` → `call $bar` diff in the .ok file (the other four cases
unaffected).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented Apr 22, 2026

Follow-up — BrTable now modelled

The earlier comment left BrTable _ -> None as "not yet modelled". Just pushed 6f2d94c2d feat(constTrack): model BrTable + 5th regression case that plugs the gap:

| BrTable (targets, default) ->
  let lru = shift_and_evict (-1) lru in  (* pop the index *)
  List.iter
    (fun t -> May_leave (label_depth t, lru) |> Effect.perform)
    (targets @ [default]);
  None

Each target — listed plus default — receives one May_leave carrying the popped-state LRU. The enclosing handler chain does the de-Bruijn decrement exactly as for Br/BrIf.

5th regression case in test/ld/zero-fwd-join.wat: $br_table_agree

(func $br_table_agree (type $t_i)
  (block (result i32)
    i32.const 0
    local.get 0
    br_table 0 0 0 0        ;; three listed + default, all target this block
  )
  local.get 0
  call $bar)

Every target contributes [d0 = I32 0] to the Block's branch_states; the fold-intersect keeps 0; the linker rewrites call $barcall $quux.

Verified as a strict regression detector by temporarily reverting only the BrTable arm of the step function while keeping the rest of Phase 3 intact: the diff shows exactly one call $quuxcall $bar regression (the new 5th function); the other four cases remain rewritten. So this commit is load-bearing just for the BrTable precision, independent of the earlier Phase 3 work.

Review-agent verdict (focused pass on the BrTable arm)

A follow-up review across five angles — Wasm semantics, per-target stack shape, duplicate targets, de-Bruijn chain through disjoint-depth targets, and the None-terminator/iteration-order details — came back with no bugs found:

  • Wasm semantics match: shift_and_evict (-1) pops the index; each possible successor gets the same post-pop state. No index-range reasoning needed — out-of-range → default is already covered by the list iteration, and must-converge intersect is over-approximating by construction (contributing to every syntactically-reachable target is sound since runtime picks exactly one).
  • Homogeneous stack shape: Wasm validation forces identical label arity across all targets + default, so sending the same lru to every target cannot produce a type-shape mismatch at the intersect site.
  • Duplicate targets harmless: intersect is idempotent on equal inputs; the 5th test's br_table 0 0 0 0 proves this end-to-end.
  • Disjoint-depth trace: walked block_outer { block_mid { block_inner { br_table 0 1 2 } } } explicitly — block_inner catches the depth-0 contribution directly; the depth-1 re-raises through block_inner to block_mid; the depth-2 re-raises through block_inner and block_mid to block_outer. Each enclosing Block receives exactly one contribution at its own level.
  • None return + targets @ [default] ordering: both correct; intersect is commutative/associative so concat order is immaterial; the top-level process_block handler swallows any May_leave escaping from a function-scope br_table.

Verdict reproduced verbatim: "sound, complete for the control-flow lattice, and consistent with the Br/BrIf arms. Arguably the minimal correct implementation given the existing Phase 3 framework."

Updated case table

Reviewer example Function name
Nested-Block de-Bruijn depth $nested_block
Block { Loop { Br 1 } } $block_over_loop
Agreeing BrIf (canonical) $agreeing_brif
If with agreeing legs $agreeing_if_legs
BrTable, all targets agree $br_table_agree

All five call sites get rewritten under Phase 3 + BrTable; each regresses to call $bar when the corresponding feature is reverted.

ggreif and others added 3 commits April 22, 2026 19:26
Updates only the footer sections:

- "Phase 3 readiness" heading renamed to note it shipped; clarifies
  no compose issues surfaced from structural equality on const_val
  because handlers exchange lru values directly.

- Open Question on block/loop/if resolution updated with the
  landing commits (7d815c4, 14c04cd, 6f2d94c) and a one-
  paragraph summary of the algebraic-effect mechanism.

- New "Implementation Status" section at the very bottom
  enumerating Phase 1/2/3 coverage, the regression tests in
  test/ld/zero-fwd-join.wat, and the reviewer passes.

Body of the plan (design sketch, code snippets) untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review comment #5961 (comment)
("Possibly a `Return`.").

moc occasionally emits an explicit `return` after a tail call; the
previous pattern only matched `[Call k]` and missed those bodies.
Extending the final-segment pattern to `[Call k] | [Call k; Return]`
picks up the variant without loosening the invariant that no other
instructions may follow the call (addressing the sibling comment
#5961 (comment)).

The zero-fwd-join test now has a second forwarder `$bar_ret` with the
trailing-Return body, and `$agreeing_brif` (case 3) is re-targeted to
call it. Verified strict detection: reverting only the `[Call k;
Return]` arm flips exactly `$agreeing_brif`'s call site back to
`call $bar_ret` (unrewritten); the other four cases still rewrite via
the non-Return arm through $bar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review comment
#5961 (comment)

The test grew beyond fmodf specifically (it covers closure-capture
forwarders, Motoko-level forwarding chains, and non-forwarders like
`baz`), so `forward-calls` is the more accurate name. Renames the
.mo and both .ok files; no internal references to update.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

opportunity More optimisation opportunities inside performance Affects only gas usage or code size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant