Skip to content

experiment: implement masked br_table dispatch for variant switches (n ≥ 4 arms)#5927

Draft
ggreif wants to merge 41 commits intomasterfrom
gabor/variant-switch
Draft

experiment: implement masked br_table dispatch for variant switches (n ≥ 4 arms)#5927
ggreif wants to merge 41 commits intomasterfrom
gabor/variant-switch

Conversation

@ggreif
Copy link
Copy Markdown
Contributor

@ggreif ggreif commented Mar 21, 2026

Summary

  • Replaces the O(n) linear comparison chain for variant switch with O(1) br_table dispatch when there are 4 or more arms
  • At compile time, finds a bitmask M (and shift S = ctz(M)) such that (hash_i & M) >> S are all distinct, then emits i32.and M; [i32.shr_u S]; br_table
  • Mask-finding uses Gosper's hack to iterate candidates in order of popcount and value, ensuring compact (low-index) masks are tried first
  • Threshold: max(64, 4n) table size; falls back to linear chain if exceeded
  • Strict win for all n ≥ 4 cases (n = 1 and 2 already handled by single_case/simplify_cases)

Benchmark (vs moc 1.3.0, test/bench/variant-switch.mo)

Baseline: PATH=/nix/store/71w3w2df8xv4x56dkff6sl5yfwd01ccc-moc/bin:$PATH

Workload moc 1.3 this branch speedup
9-arm switch loop ×10 000 (go) 137,590,321 95,690,321 1.44×
AST eval fib(7) ×100 24,492,248 22,034,248 1.11×
AST→FT transform ×100 1,189,148 1,057,948 1.12×
FT eval fib(7) ×100 21,519,148 21,519,148 1.00× (noise floor — no variant dispatch)

The FT row is unchanged by design: the finally-tagless form has no Expr variant dispatch in its hot loop, confirming the other speedups are real and attributable to the switch optimisation.

Test plan

  • make -C test/run variant_switch.only — passes (4-arm Color, 7-arm Weekday, 4-arm Shape with payloads)
  • make -C test/run variants.only — existing variant tests pass (no regressions)
  • Inspect WAT output: br_table with i32.and emitted; i32.shr_u emitted when S > 0 (confirmed for Weekday: mask 0x15000, shift 12)

Key files

  • src/codegen/compile_classical.ml — helpers bits_needed, iter_masks_with_popcount (Gosper's hack), is_injective, compact_table_size, find_variant_mask; new SwitchE case
  • test/run/variant_switch.mo — new test
  • .claude/plans/variant-switch-br-table.md — design plan (includes future work: same-body arm merging, or-pattern handling)

TODOs

  • effective branches (provably same result)
  • distill not from IR, but from the (^^^) EDSL (maybe change to finally-tagless?)

🤖 Generated with Claude Code

ggreif and others added 4 commits March 21, 2026 02:23
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
At compile time, find a bitmask M (and shift S = ctz(M)) such that
(hash_i & M) >> S are all distinct for the n variant tags. Then emit:

    local.get $tag_field
    i32.const M          ;; compile-time constant
    i32.and
    i32.const S          ;; omitted when S = 0
    i32.shr_u            ;; omitted when S = 0
    br_table ...         ;; O(1) dispatch

compared to the previous O(n) linear comparison chain. The break-even
is at n = 3 (worst case) / n = 3 (average), but n = 1 and n = 2 are
already handled by single_case / simplify_cases, so the new path is
a strict win for every applicable case (n ≥ 4 with all TagP arms).

Mask-finding uses Gosper's hack to iterate candidate masks in order
of increasing value, ensuring compact (low-index) masks are tried
first and table sizes remain small. Threshold: max(64, 4n).

Also: add test/run/variant_switch.mo covering 4-arm, 7-arm and
payload-carrying variant switches; add "same-body arm merging" to
the plan as a future optimisation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ggreif ggreif requested a review from a team as a code owner March 21, 2026 02:13
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ggreif ggreif changed the title Implement masked br_table dispatch for variant switches (n ≥ 4 arms) Implement masked br_table dispatch for variant switches (n ≥ 4 arms) Mar 21, 2026
ggreif and others added 8 commits March 21, 2026 03:27
…d.ml)

Same optimisation as the classical backend: n ≥ 4 all-TagP variant
switches get masked br_table dispatch instead of a linear comparison
chain.  The EOP backend uses int64 hashes throughout, so the helpers
use Int64 arithmetic.  All bench tests now pass (instruction counts
updated to reflect the savings).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
br_table always expects an i32 operand; the EOP backend operates on
i64 values, so add an i64.to_i32 (WrapI64) conversion after the
masked/shifted tag before the br_table instruction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…imisation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmarks the masked br_table dispatch by traversing a synthetic
~700-node expression tree (constructors: Var, Lit, App, Lam, Let,
LetRec, Case, Con) 10_000 times and reporting instruction counts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs in compile_enhanced.ml's masked br_table dispatch for EOP:

1. iter_masks_with_popcount iterated all 64-bit k-bit patterns.  EOP
   variant hashes are extend_i32_u values (bits 0-31 only), so masks
   with bits ≥ 32 are useless.  For k≥4 this blew up: C(64,4)=635k
   vs C(32,4)=36k, causing the compiler to hang.  Fix: cap the loop at
   k > 32 / mask ≥ 2^32, matching the classical int32 backend.

2. Nat64.of_int64 mask crashed when mask was negative (bit 63 set).
   Gosper's hack can produce such masks before the new early-exit
   terminates.  Fix: replace with a local ctz64 that works for any
   non-zero int64 by isolating the lowest set bit via a shift loop.

Also adds test/bench/variant-switch.mo: GHC-Core-like 8-arm expression
interpreter bench (Var/Lit/App/Lam/Let/LetRec/Case/Con) exercising the
hot-path switch dispatch at 10k iterations over a 24-node tree.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e s)

Cases 0 and 1 previously returned leaf nodes (#Var "x", #Lit d) ignoring
the sub-tree s, causing the tree to reset every 8 levels and stay at ≤24
nodes regardless of depth.  Replace with #App(#Var "x", s) and #Lam("k", s)
so every level wraps s; #Case at d%8=6 doubles s, giving exponential growth.

build 15 now produces 80 nodes (800k total/10k iterations, ~70M instructions)
vs the old 24-node cycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add 2^10 iteration cutoff to iter_masks_with_popcount in both backends
  to prevent O(C(32,k)) compile-time blowup on switches where no injective
  mask exists (e.g. nested-pattern switches with duplicate outer labels)
- Add distinct-labels guard to the SwitchE br_table branch: only fire when
  all outer TagP labels are unique, as the known_tag_pat arm codes (outer
  tag check stripped) are only correct for flat variant dispatch
- Document both issues and the deeper fix (None fallback should fall through
  to regular handler) in the plan

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r dispatch)

Move find_variant_mask into the `when` guard in both backends so that a None
result (cutoff reached, no valid mask, or duplicate labels) causes the guard
to fail and OCaml falls through to the regular SwitchE handler with full
patterns.  This eliminates:
- The broken None branch that used known_tag_pat arms (no outer tag check),
  which caused incorrect dispatch (e.g. debug_show on 12-arm Action_ type
  routed #RegisterKnownNeuron to #AddOrRemoveNodeProvider)
- The distinct-labels workaround (now fully subsumed: duplicate labels make
  is_injective fail for every mask, so find_variant_mask returns None)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ggreif ggreif changed the title Implement masked br_table dispatch for variant switches (n ≥ 4 arms) experiment: implement masked br_table dispatch for variant switches (n ≥ 4 arms) Mar 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 21, 2026

Comparing from bc32e40 to 704169d:
In terms of gas, 1 tests improved and the mean change is -0.0%.
In terms of size, 1 tests improved and the mean change is -0.0%.

ggreif and others added 6 commits March 21, 2026 11:07
…-31 masks

- Raise Gosper iteration cutoff from 2^10 to 2^16 in both backends,
  enabling larger variant types (e.g. 12-arm NNS Action_ type) to find
  a compact mask where the 2^10 limit would time out
- Change classical backend loop guard from `!m <> 0l` to `!m > 0l`:
  stops at zero (wrapped past 2^32) AND at negative int32 (bit 31 set).
  Motoko hashes are 31-bit (Mo_types.Hash.hash always clears bit 31),
  so masks with bit 31 set are irrelevant and Nat32.of_int32 would crash
  on them with Invalid_argument("value out of bounds")
- EOP backend already caps at 0x1_0000_0000 (32-bit range); 31-bit cap
  is implicitly safe there since hashes fit in 31 bits
- Document resolution of plan item 3 (31-bit vs 32-bit hashes)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document a future optimisation replacing the single Gosper stream with
concurrent strategy generators (MaskShift batched by bit-window, ModPrime,
RotLow) merged round-robin and ranked by cycle-cost estimate.  This avoids
any single strategy's worst case dominating compile time, and subsumes the
Pre-shortening section.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a 9th variant `#Prim : Char` for primitive operations (needed for
upcoming `fib` benchmark). Wrapped as `#App (#Prim '+', s)` in `build`
so the recursive sub-tree is preserved and node count stays ~82.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Encodes fib over Peano naturals using #LetRec/#Lam/#Case/#Con/#Prim.
Currently unused (_fibCore); will serve as the benchmark program for
an upcoming eval function.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Avoids computing #App (#Prim '-', #Var "n") twice in the recursive arm.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ggreif ggreif self-assigned this Mar 23, 2026
@ggreif ggreif marked this pull request as draft March 24, 2026 09:20
ggreif and others added 3 commits March 24, 2026 10:36
Also removes _ prefix from fibCore now that it is used.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…both

Adds:
- Val/Env runtime types (Peano naturals via #VCon)
- Direct AST interpreter: eval : (Expr, Env) -> Val
- Finally-tagless machinery: FT = Env -> Val, Symantics record, transform
- evalSem: the evaluating Symantics (record of closures, no variant dispatch)
- evalBench: runs fib(7) 100x via eval and via FT, reports instruction counts

Result: fib(7)=13 correct for both; FT ~5% fewer instructions than direct eval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmarks the `transform(evalSem, fibCore)` step (100 iterations) and
verifies correctness via `fib7_xform`. Updates `.ok` with new instruction
counts including `instr_transform`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ggreif ggreif added the performance Affects only gas usage or code size label Apr 1, 2026
@ggreif ggreif added the performance Affects only gas usage or code size label Apr 1, 2026
github-merge-queue Bot pushed a commit that referenced this pull request Apr 20, 2026
## Summary

Adds `test/bench/variant-switch.mo` — a small GHC-Core-like interpreter
that exercises 9-arm variant-switch dispatch in several shapes. Serves
as a reference point (baseline on `master`) for the
dispatch-optimisation work on #5927 (masked `br_table`) and any
follow-ups.

Actor methods:
- `go` — top-level `size tree + size fibCore`, ×10k.
- `evalBench` — `fib(7)` via direct AST eval, compiled finally-tagless
form, and the AST→FT transform itself; ×100 each.
- `weekdayBench` — `isWeekend` (7 explicit arms) vs `isWeekendOr` (same
dispatch via `or`-patterns); ×10k over a 7-arm `Weekday` variant.
- `getPerfData` — reports `rts_lifetime_instructions`.

## Master baseline

| Metric | Instructions |
|---|---|
| `size tree + fibCore` (×10k) | 137,590,321 |
| `eval fib(7)` AST (×100) | 24,509,348 |
| `eval fib(7)` FT (×100) | 21,536,248 |
| AST→FT transform (×100) | 1,189,148 |
| `isWeekend` (×10k × 7 arms) | 10,010,321 |
| `isWeekendOr` (×10k × 7 arms) | 11,050,321 |

With #5927 applied, the explicit-arm dispatch numbers drop materially;
or-patterns currently don't benefit (separate follow-up).

## Test plan

- [x] `make -C test/bench variant-switch.only` passes on `master`.
- [x] drun output captured in
`test/bench/ok/variant-switch.drun-run.ok`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ggreif and others added 10 commits April 22, 2026 23:19
Mirrors isWeekend exactly but collapses the 5 weekday cases into one
or-pattern arm and the 2 weekend cases into another. Exercises the
same-body arm-merging path noted in 9d7c498.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V1 (IR-level, `SwitchE` with `TagP` arms) is blind to or-patterns: the
semantically-identical `isWeekday` (7 flat arms) and `isWeekdayOr`
(2-arm or-pattern) end up with different IR shapes, so only the
former hits the `br_table` dispatch path.

Revise "Where to Apply" to introduce Option C: recognise the
tag-hash-compare fragments at the `patternCode` EDSL level, where
both shapes have already elaborated to the same `(^^^)` chain.
Architecture is three handlers using OCaml 5.3 algebraic effects
(same machinery as ConstTrack Phase 3):

  - Recognizer: perform `Variant_arm {hash; body_code}` at each
    tag-compare site.
  - Strategy query: at `compile_switch` entry, collect the effect
    trace and perform `Dispatch_strategy tag_set` to let the
    enclosing handler choose a plan (MaskShift / ModPrime / RotLow
    / Linear).
  - Emit: dispatch on the returned plan; each emitter is
    independently unit-testable.

Annotate the "Same-body arm merging" section to note it becomes
automatic under V2 — distinct arms with identical bodies naturally
collide on effect-payload equality.

V1 stays in place; V2 lands first in `compile_enhanced.ml` (where
`patternCode` lives), then backports to classical if the
architecture proves out.
Three refinements on top of the previous V2 reframing:

1. Handler ↔ Recognizer become distinct roles, not a 3-step pipeline:
   - Handler sees the IR dispatch node + its type; knows which
     strategies are meaningful for this decision shape.
   - Recognizer lives inside the matching EDSL and sees the fully-
     elaborated "test and branch" fragments; performs
     `Match_decision { token_set; scrutinee_repr; type_info }`.

2. The protocol is explicitly generic — not variant-specific. The
   handler interprets `token_set` (tag hashes, literal immediates,
   nominal IDs, …); `scrutinee_repr` abstracts how to obtain the
   discriminating value. Future applications flagged: AND-patterns
   (where the handler can short-circuit components already matched
   by an outer context) and literal-match chains.

3. V2 launch scope narrowed to Gosper-based MaskShift only. The
   multi-strategy batched search stays listed as future work; the
   protocol is forward-compatible.

Success criterion: or-pattern switches must compile to byte-identical
Wasm as their hand-expanded flat-arm equivalents (same mask, shift,
table, arm blocks modulo label numbering). FileCheck test pinning
this equivalence is the V2 acceptance gate.
…ing)

Earlier drafts left ambiguous whether the recognizer walks an
elaborated EDSL tree or fires effects during emission. The EDSL's
value type is opaque — `patternCode = CannotFail of G.t | CanFail of
(G.t -> G.t)` — so no walkable AST survives composition. Clarify:

  - Recognizer fires `Match_decision` effects *during* the procedural
    emitting-combinator calls (`fill_pat`, `compile_pat_local`, …),
    not after.
  - `(^^^)`, `orElse`, `orsPatternFailure` stay pure G.t manipulation.
    Effects attach at leaf combinators that know what's being
    discriminated (TagP, AltP-over-TagPs, later LitP).
  - `body_compiler` is a thunk the handler chooses to invoke or not,
    giving it *control over emission* rather than just strategy
    selection. This is what makes future AND-patterns (where a
    component may already be known from an outer context) a natural
    extension: the handler returns No_op and suppresses emission.

List the concrete perform-sites for V2: `TagP`, `AltP` bottoming out
in `TagP` (the or-pattern fold), and the future `LitP` extension.
An or-pattern over tag constructors — e.g. `(#mon | #tue | ... | #fri)
false` — was previously an opaque single arm from the V1 br_table
guard's perspective, so `isWeekdayOr` did not get the `br_table`
treatment that the structurally-identical `isWeekday` (7 flat arms)
received. Extend the guard to recognise `AltP` chains bottoming out
in `TagP` leaves and count the leaves (not the cases) toward the
4-arm threshold.

Per arm, compile the body once using the first leg's sub-pattern
(Motoko's or-pattern typing guarantees all legs bind the same
variables). Every leaf of a case contributes one slot in the
dispatch table pointing to the same arm block, so same-body arm
merging now happens automatically for or-patterns — the emitted
Wasm is strictly smaller than the hand-expanded flat equivalent
while running the same number of instructions.

Benchmark (`test/bench/variant-switch.mo`, `instr_isWeekendOr`):
  11_050_321 → 7_070_321 (1.56×), matching `instr_isWeekend`
  exactly. Other bench rows also improved where or-patterns nest
  in the dispatch path.

Mirrors the change in both `compile_classical.ml` and
`compile_enhanced.ml`; introduces a shared `flatten_tag_leaves`
helper next to `known_tag_pat`. Effect-based handler/recognizer
split per `.claude/plans/variant-switch-br-table.md` lands as a
follow-up refactor.
… (enhanced)

First concrete step of the V2 handler/recognizer refactor described in
.claude/plans/variant-switch-br-table.md. Introduce a `Dispatch` module
with a `Query : int64 list list -> plan` algebraic effect (OCaml 5.3,
same machinery as ConstTrack Phase 3). The default handler runs
Gosper-based mask-finding and returns a `MaskShift { mask; shift;
table_size; slot_for_case }` plan, or `Linear` as a fallback.

The recognizer at the SwitchE variant case now collects per-case leaf
hashes (one sub-list per case, with or-pattern legs contributing
multiple entries) and asks the handler for a plan. The guard checks
`MaskShift`; the body binds the plan's fields and emits the br_table
dispatch using `slot_for_case` directly instead of rebuilding it
inline.

Behaviour-preserving: compiled Wasm is byte-identical to the previous
commit for `isWeekday` and `isWeekendOr` (same mask 0x15000, shift 12,
23-slot table, same block labels). `variant_switch.mo` passes all
phases.

Why this shape is useful even before landing further refactors:
  - Strategy logic (currently Gosper) is encapsulated in one place.
    Adding ModPrime / RotLow / Linear heuristics becomes a change in
    the handler, not in SwitchE.
  - Outer scopes (tests, debug flags, size-budget passes) can install
    their own handler to override the plan without touching the
    recognizer.
  - The protocol is token-agnostic — future LitP / AndP dispatch can
    perform the same effect with their own token types.

Scope: enhanced backend only. `compile_classical.ml` keeps the inlined
extraction for now and can be retrofitted once the architecture proves
out (per user priority: tests live under enhanced).

Known duplication: the SwitchE guard currently runs `Dispatch.Query`
a second time in the body to pattern-match the plan's fields. Threading
one plan through guard and body is a follow-up — kept simple here to
minimise the diff and make the effect protocol the sole behavioural
change.
…anced)

Previous commit queried `Dispatch.Query` twice — once in the `when`
guard (to test `MaskShift`) and once in the body (to destructure the
plan). The guard fired side-effect-free and the result was discarded.

Collapse the guard + variant arm + default arm into a single
`SwitchE (e, cs) ->` arm that computes `maybe_plan` once. `Some
MaskShift` emits br_table; `Some Linear` or `None` (any case failing
`flatten_tag_leaves`) falls through to the linear-chain emission
inlined under the same arm. Scrutinee compilation (`code1`,
`set_i`/`get_i` local) is hoisted out and shared between the two
paths.

Byte-identical Wasm to the previous commit for `isWeekday` and
`isWeekendOr` (same dispatcher: mask 0x15000, shift 12, 23-slot
table; same block labels). `variant_switch.mo` and `variants.mo`
both pass.

Net: -16 lines and only one `Dispatch.with_handler` invocation
per SwitchE node.
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented Apr 22, 2026

Reference: weekday variant hashes

Recovered from the compiled Wasm (i64.const at each #Day construction site) and cross-checked against the br_table slots in isWeekend/isWeekendOr. Useful as a concrete worked example when developing alternative Dispatch strategies.

tag hash (dec) hash (hex) & 0x15000 br_table slot
#Mon 3_853_996 0x3ACEAC 0x04000 4
#Tue 4_203_428 0x4023A4 0x00000 0
#Wed 4_349_046 0x425C76 0x05000 5
#Thu 4_200_545 0x401861 0x01000 1
#Fri 3_506_557 0x35817D 0x10000 16
#Sat 4_149_254 0x3F5006 0x15000 21
#Sun 4_153_708 0x3F616C 0x14000 20

Gosper's mask picks the three bits at positions 12, 14, 16 (mask = 0x15000, shift = 12). Seven weekday tags land on seven distinct slot values {0, 1, 4, 5, 16, 20, 21}; slot 17 is the one valid-but-unused slot visible in the br_table's default-label entries — any future weekday whose hash happened to set bits 0 and 4 (post-shift) would land there without growing the table.

Pattern reading: bit 16 of the hash == weekend-side (set for Fri/Sat/Sun, clear for Mon/Tue/Wed/Thu); bits 12/14 distinguish days within each half.

Strategy notes for the dispatch handler:

  • MaskShift (current default, Gosper): 23-slot table, 2 arithmetic ops (AND + SHR_U) + br_table.
  • ModPrime could try small primes p ≥ 7; e.g. does hᵢ mod 11 give 7 distinct residues? (Quick check: 3853996 % 11 = 2, 4203428 % 11 = 9, 4349046 % 11 = 10, 4200545 % 11 = 6, 3506557 % 11 = 3, 4149254 % 11 = 7, 4153708 % 11 = 6 — collision Thu↔Sun on mod 11; try 13, 17, …). Cheaper table (size p) but rem_u is ~3× an and.
  • RotLow bits=3 rot=? — would need table size 8; worth trying for comparison once the handler protocol supports it.

startLetter: 7-way distinct dispatch (flat vs or-pattern)

A richer 7-way switch added to test/bench/variant-switch.mo. Two forms produce the same mask=0x15000, shift=12 dispatcher (same 7 weekday hashes), but partition the arm blocks differently:

startLetter (7 flat arms) — 8 arm blocks

slot case idx tag outcome
0 1 #Tue 'T'
1 3 #Thu 'T'
4 0 #Mon 'M'
5 2 #Wed 'W'
16 4 #Fri 'F'
20 6 #Sun 'S'
21 5 #Sat 'S'

The Tue and Thu arm blocks are byte-identical (both load i64.const 741070837121024 = 'T'); likewise Sat and Sun both load 732274744098816 = 'S'. Duplicated code, but each block is entered via exactly one br_table slot, so each call costs the same regardless of duplication.

startLetterOr (5 arms, 2 or-patterns) — 5 arm blocks

slot case idx tag outcome
0 1 #Tue 'T' (case shared with #Thu)
1 1 #Thu 'T' (same case)
4 0 #Mon 'M'
5 2 #Wed 'W'
16 3 #Fri 'F'
20 4 #Sun 'S' (case shared with #Sat)
21 4 #Sat 'S' (same case)

Same br_table, fewer arm blocks (two pairs collapsed).

Bench numbers (10 000 × 7-day iterations)

metric instructions
instr_isWeekend 7_070_321 (incl. outer if branch cost)
instr_isWeekendOr 7_070_321
instr_startLetter 6_270_321 (clean dispatch + Char sink)
instr_startLetterOr 6_270_321

Key observation for strategy design: startLetter and startLetterOr cost exactly the same at runtime despite the or-pattern form having 2 fewer arm blocks. Same-body arm duplication is a pure code-size bloat — each slot's br_table entry jumps directly into its own block, which executes the same ≈3 Wasm instructions regardless of whether a byte-identical block lives next to it.

Where same-body merging would matter: upstream of the dispatcher. If the recognizer were to merge equivalence classes before handing the token set to the handler, N = 5 instead of N = 7. Then ModPrime could try mod 5 instead of mod 7 (half the table bytes), MaskShift would have fewer injectivity constraints (possibly a smaller mask), and perfect-hash search gets cheaper. That's the lever the plan now captures under Future Optimisation: Same-body arm merging — with the refinement that the equivalence criterion should be raw Wasm byte sequences (each arm is already a Block internally and (^^^) is difference-list concatenation of G.t), not IR structural equality.

For V2 as shipped here, or-pattern merging is the only channel; cross-case merging stays user-driven (write the or-pattern to get the size win).

ggreif and others added 6 commits April 23, 2026 01:16
…valence later

Rework the "Future Optimisation: Same-body arm merging" section to
record today's scope decision and the refinement direction.

Key points captured:

- V2 deliberately does NOT auto-merge arms with structurally-equal
  bodies across distinct cases. User-written or-patterns are the
  incentive channel — they communicate intent and are stable under
  refactoring. The recognizer's `flatten_tag_leaves` already collapses
  or-pattern legs; cross-case merging stays out of scope.

- Why merging matters upstream: same-body merging is a code-size win
  (duplicated arm blocks saved) but NOT a speedup for the
  already-dispatched case — each br_table slot still lands in its own
  block executing the same instructions. The runtime payoff is in
  *strategy choice*: the handler's search space is parameterised by N =
  distinct outcome classes, so ModPrime uses a smaller prime, MaskShift
  has fewer injectivity constraints, perfect-hash search gets cheaper.

- When cross-case merging eventually lands, the equivalence criterion
  should be raw Wasm byte sequences, not IR structural equality. Each
  arm is already a `Block` internally; `(^^^)` composition is
  difference-list concatenation of `G.t`; comparing compiled bytes
  skips IR phase-ordering noise and catches arms that incidentally
  lower to the same instructions. The `Dispatch.Query` protocol is
  already compatible (token_set is list-of-lists) — this is a
  recognizer-side extension, not a protocol change.
The existing isWeekend/isWeekendOr pair returns Bool, so an outer
`if (isWeekend d) acc1 += 1` branch muddies the per-switch cost.
Add a 7-way distinct-outcome pair that writes a Char sink directly,
giving a cleaner microbench of switch dispatch alone.

Bodies of startLetter include two natural same-body groups:
{Tue, Thu} → 'T' and {Sat, Sun} → 'S'. The -Or form collapses these
into or-patterns. Both compile to the same br_table dispatcher and
execute identical instruction counts — same-body arm blocks cost the
same regardless of whether they're physically one block (or-pattern)
or duplicated (flat). Useful datapoint when evaluating future Dispatch
strategies.

On the current branch (Gosper MaskShift, same-body merging via
or-patterns only):

  instr_isWeekend      = 7_070_321   (outer `if` adds ~800k)
  instr_isWeekendOr    = 7_070_321
  instr_startLetter    = 6_270_321   (cleaner: no outer branch)
  instr_startLetterOr  = 6_270_321
…tch_join)

Replaces the single one-shot effect
  `Query : int64 list list -> plan`
with a streaming pair:
  `Match_arm : int64 list -> unit`   (submit one case's leaves)
  `Match_join : plan`                 (join all arms into a plan)

The recognizer at SwitchE now iterates cases, performs `Match_arm
hashes` per case, then `Match_join` to receive the plan. The handler
accumulates arms in a mutable ref across the stream and commits the
Gosper-based plan at join time.

Behaviour-preserving: variant_switch.mo and variants.mo pass; the
br_table for startLetter is byte-identical to the non-streaming
version (same mask 0x15000, shift 12, 23-slot table, same labels).

Why change the shape now, before any consumer needs it? The streaming
protocol is how AND-patterns and literal-match chains will surface
decisions — subcomponents fire `Match_arm` incrementally, only the
outer context knows when to `Match_join`. Locking the protocol in now
means those future recognizers slot in without a breaking-change to
the effect type. The nested-switch case also cleanly works because
each `with_handler` scope has its own accumulator ref; state doesn't
leak between switches.

The naming `Match_join` rather than `Match_close` reflects the
semantics: the handler joins submitted arms into one dispatch
decision, it's not merely closing a stream.
…stration

Adds the second concrete plan variant `ModPrime { p; case_for_residue }`
to the `Dispatch` protocol, emitted as `hash rem_u p; br_table` with
a p-slot table. The handler searches primes {2, 3, 5, 7, ..., 31}
smallest-first and accepts the first that partitions the input
cleanly (all leaves of the same case share a residue AND different
cases land on different residues).

The `choose_plan` policy is intentionally ad-hoc:

  - if `n < 4` → Linear
  - else if `c < n` → ModPrime (fall back to MaskShift if no prime works)
  - else                → MaskShift

where `n` = total leaves, `c` = number of cases. `c < n` ↔ at least
one case has an or-pattern.

Why the ad-hoc split matters: it gives the handler a *visible,
measurable* reason to branch on or-pattern structure, proving the
Dispatch protocol is wired up to select different strategies for
or-patterns vs flat expansions of the same switch. Bench numbers
after this patch:

  instr_isWeekend     = 7_070_321    (MaskShift: c=n=7)
  instr_isWeekendOr   = 7_560_321    (ModPrime:  c=2, n=7)
  instr_startLetter   = 6_270_321    (MaskShift: c=n=7)
  instr_startLetterOr = 6_760_321    (ModPrime:  c=5, n=7)

ModPrime is *slower* than MaskShift per dispatch under the ICP cycle
model (`rem_u` > `and`+`shr_u`), so this policy presently makes
or-patterns run worse than their flat equivalents — the opposite of
what we ultimately want. That's by design for now: the point is to
show the protocol can differentiate. A follow-up commit will replace
the ad-hoc policy with something smarter (likely case-aware Gosper:
extend `find_variant_mask` to accept same-case hashes sharing a
slot, yielding smaller masks and a measurable or-pattern *win*).
The `Dispatch` protocol, emitters, and plan variant surface all stay
put across that refinement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Affects only gas usage or code size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant