Skip to content

feat: introduce a ComputeBackend seam to abstract the forward-execution engine#446

Merged
inureyes merged 2 commits into
mainfrom
feat/issue-338-compute-backend-seam
Jun 25, 2026
Merged

feat: introduce a ComputeBackend seam to abstract the forward-execution engine#446
inureyes merged 2 commits into
mainfrom
feat/issue-338-compute-backend-seam

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Introduce a ComputeBackend seam so a future non-MLX execution engine can host forward() without routing through the MLX bridge. The motivating target is FuriosaAI TCP / RNGD, whose furiosa-opt toolchain compiles to a virtual ISA and cannot use MLX at all. This PR lands the seam only and is backend-neutral; it implements no Furiosa or TCP kernels.

Seam design

ComputeBackend (new src/backend/mod.rs) abstracts who executes forward(), not individual ops. It sits at the model-load boundary (called once per load), not at the per-token forward() call, so MLX graph fusion and mx.compile stay intact and no indirection enters the inner loop. The trait is drawn narrowly to the load entry points (load_model, load_model_with_adapter, load_model_with_tensor_parallel) so the MLX path keeps its concrete hot types exposed: paged KV, prompt-cache detach and adopt, and cache tensors are never type-erased behind Box<dyn>. A non-MLX engine implements the same LanguageModel forward contract behind a different backend.

What moved behind MlxBackend

MlxBackend (src/backend/mlx.rs) is a zero-sized type that implements ComputeBackend by delegating unchanged to the existing crate::loading entry points. No loading logic is reimplemented; the same code runs whether reached directly or through the seam. The public load_model / load_model_with_adapter / load_model_with_tensor_parallel functions remain available (tests and bench harnesses still use them).

Feature flag

A default-off experimental-backend Cargo feature gates the optional non-MLX path. The src/backend/experimental.rs scaffold module and the Backend::Experimental enum variant are cfg-gated behind it, so shipping binaries (Apple Silicon, CUDA) compile no extra backend code. The scaffold reports "not implemented" rather than pretending to load; a real engine (and any hardware-feasibility gate) is future work. The feature-on select_backend() is the only place that reads a runtime backend switch (MLXCEL_BACKEND), and it is compiled only when the feature is enabled.

How codegen equivalence / no dispatch is guaranteed when the feature is off

Backend is an enum whose only variant under default features is Backend::Mlx, and MlxBackend is zero-sized, so Backend is itself zero-sized with no discriminant. select_backend() (#[inline], #[must_use]) is a constant constructor that always returns that one variant with no environment read and no branch. Every Backend method is a single-arm match marked #[inline]. After inlining (release builds use fat LTO and codegen-units = 1), select_backend().load_model(p) lowers to a direct call to the existing MLX loader, identical to the pre-seam build. The env-reading selection path and the second enum variant only exist under #[cfg(feature = "experimental-backend")], so they are absent from default codegen entirely.

How behavior preservation is ensured

Behavior is preserved by construction: the control-plane load call sites now go through the seam, and the seam methods delegate to the unchanged loaders, so the same loading and forward code runs. Rerouted call sites: src/commands/generate.rs (primary load: tensor-parallel / adapter / plain, plus the offline draft-model load), src/commands/chat.rs (REPL load), and src/server/model_worker.rs (both the batched and the legacy sequential worker loops, each covering tensor-parallel / adapter / plain). The pipeline-parallel branch keeps its own distributed loader and does not go through the seam.

What changed

  • src/backend/mod.rs (new): ComputeBackend trait, Backend enum, select_backend().
  • src/backend/mlx.rs (new): MlxBackend, delegates to crate::loading.
  • src/backend/experimental.rs (new, cfg-gated): scaffold plug-in slot for a non-MLX engine.
  • src/backend/tests.rs (new): scoped seam tests.
  • src/lib.rs: pub mod backend; and re-export of Backend, ComputeBackend, MlxBackend, select_backend.
  • Cargo.toml: default-off experimental-backend feature.
  • src/commands/generate.rs, src/commands/chat.rs, src/server/model_worker.rs: route loads through select_backend().

Test plan

  • cargo check --lib --features metal,accelerate
  • cargo check --bins --features metal,accelerate
  • cargo check --lib --features metal,accelerate,experimental-backend (gated module compiles)
  • cargo clippy --lib --tests --features metal,accelerate -- -D warnings
  • cargo clippy --lib --features metal,accelerate,experimental-backend -- -D warnings
  • cargo test --lib backend:: --features metal,accelerate (3 tests pass)
  • cargo fmt --check
  • Real-checkpoint temp-0 byte-identical token parity and throughput on the MLX path are owned by the maintainer's release-build parity gate.

Closes #338

…on engine

Add a backend boundary so a future non-MLX execution engine can host forward() without routing through the MLX bridge. The motivating target is FuriosaAI TCP / RNGD, whose furiosa-opt toolchain compiles to a virtual ISA and cannot use MLX at all. This change lands the seam only and is backend-neutral; it implements no Furiosa or TCP kernels.

The new src/backend module defines a ComputeBackend trait that abstracts who executes forward(), not individual ops. The seam sits at the model-load boundary (once per load), not at the per-token forward() call, so MLX graph fusion and mx.compile are untouched and no indirection enters the inner loop. The trait is drawn narrowly to the load entry points so the MLX path keeps its concrete hot types (paged KV, prompt-cache detach and adopt, cache tensors) exposed. MlxBackend (src/backend/mlx.rs) implements the trait by delegating unchanged to the existing crate::loading entry points, so the same loading and forward code runs whether reached directly or through the seam.

Selection folds away under default features. Backend is an enum whose only variant is Backend::Mlx, and MlxBackend is a zero-sized type, so the enum is zero-sized with no discriminant. select_backend() always returns that one variant with no environment read and no branch, and every Backend method is a single-arm match marked #[inline]. After inlining, select_backend().load_model(p) lowers to a direct call to the existing MLX loader, identical codegen to the pre-seam build. The optional non-MLX path lives behind the default-off experimental-backend Cargo feature: the experimental module and the Backend::Experimental variant are cfg-gated, so shipping binaries (Apple Silicon, CUDA) compile no extra backend code, and the feature-on select_backend() (the only place that reads an env switch) is compiled only when the feature is enabled.

Behavior is preserved by construction: the control-plane load call sites (CLI generate and chat, both server model-worker loops, and the offline draft-model load) now call select_backend().load_model* instead of crate::load_model*, and those methods delegate to the unchanged loaders. The public load_model / load_model_with_adapter / load_model_with_tensor_parallel functions remain available. Focused tests assert selection resolves to MLX under default features, that the seam reaches the real MLX loader (errors on a missing directory rather than a backend shim), and compile-time that MlxBackend implements ComputeBackend and LoadedModel implements LanguageModel.
@inureyes inureyes added type:enhancement New features, capabilities, or significant additions priority:medium Medium priority area:architecture Architecture and code structure changes status:review Under review labels Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:architecture Architecture and code structure changes priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: introduce a ComputeBackend seam to abstract the forward-execution engine

1 participant