feat: introduce a ComputeBackend seam to abstract the forward-execution engine#446
Merged
Merged
Conversation
…on engine Add a backend boundary so a future non-MLX execution engine can host forward() without routing through the MLX bridge. The motivating target is FuriosaAI TCP / RNGD, whose furiosa-opt toolchain compiles to a virtual ISA and cannot use MLX at all. This change lands the seam only and is backend-neutral; it implements no Furiosa or TCP kernels. The new src/backend module defines a ComputeBackend trait that abstracts who executes forward(), not individual ops. The seam sits at the model-load boundary (once per load), not at the per-token forward() call, so MLX graph fusion and mx.compile are untouched and no indirection enters the inner loop. The trait is drawn narrowly to the load entry points so the MLX path keeps its concrete hot types (paged KV, prompt-cache detach and adopt, cache tensors) exposed. MlxBackend (src/backend/mlx.rs) implements the trait by delegating unchanged to the existing crate::loading entry points, so the same loading and forward code runs whether reached directly or through the seam. Selection folds away under default features. Backend is an enum whose only variant is Backend::Mlx, and MlxBackend is a zero-sized type, so the enum is zero-sized with no discriminant. select_backend() always returns that one variant with no environment read and no branch, and every Backend method is a single-arm match marked #[inline]. After inlining, select_backend().load_model(p) lowers to a direct call to the existing MLX loader, identical codegen to the pre-seam build. The optional non-MLX path lives behind the default-off experimental-backend Cargo feature: the experimental module and the Backend::Experimental variant are cfg-gated, so shipping binaries (Apple Silicon, CUDA) compile no extra backend code, and the feature-on select_backend() (the only place that reads an env switch) is compiled only when the feature is enabled. Behavior is preserved by construction: the control-plane load call sites (CLI generate and chat, both server model-worker loops, and the offline draft-model load) now call select_backend().load_model* instead of crate::load_model*, and those methods delegate to the unchanged loaders. The public load_model / load_model_with_adapter / load_model_with_tensor_parallel functions remain available. Focused tests assert selection resolves to MLX under default features, that the seam reaches the real MLX loader (errors on a missing directory rather than a backend shim), and compile-time that MlxBackend implements ComputeBackend and LoadedModel implements LanguageModel.
This was referenced Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduce a
ComputeBackendseam so a future non-MLX execution engine can hostforward()without routing through the MLX bridge. The motivating target is FuriosaAI TCP / RNGD, whosefuriosa-opttoolchain compiles to a virtual ISA and cannot use MLX at all. This PR lands the seam only and is backend-neutral; it implements no Furiosa or TCP kernels.Seam design
ComputeBackend(newsrc/backend/mod.rs) abstracts who executesforward(), not individual ops. It sits at the model-load boundary (called once per load), not at the per-tokenforward()call, so MLX graph fusion andmx.compilestay intact and no indirection enters the inner loop. The trait is drawn narrowly to the load entry points (load_model,load_model_with_adapter,load_model_with_tensor_parallel) so the MLX path keeps its concrete hot types exposed: paged KV, prompt-cache detach and adopt, and cache tensors are never type-erased behindBox<dyn>. A non-MLX engine implements the sameLanguageModelforward contract behind a different backend.What moved behind MlxBackend
MlxBackend(src/backend/mlx.rs) is a zero-sized type that implementsComputeBackendby delegating unchanged to the existingcrate::loadingentry points. No loading logic is reimplemented; the same code runs whether reached directly or through the seam. The publicload_model/load_model_with_adapter/load_model_with_tensor_parallelfunctions remain available (tests and bench harnesses still use them).Feature flag
A default-off
experimental-backendCargo feature gates the optional non-MLX path. Thesrc/backend/experimental.rsscaffold module and theBackend::Experimentalenum variant arecfg-gated behind it, so shipping binaries (Apple Silicon, CUDA) compile no extra backend code. The scaffold reports "not implemented" rather than pretending to load; a real engine (and any hardware-feasibility gate) is future work. The feature-onselect_backend()is the only place that reads a runtime backend switch (MLXCEL_BACKEND), and it is compiled only when the feature is enabled.How codegen equivalence / no dispatch is guaranteed when the feature is off
Backendis an enum whose only variant under default features isBackend::Mlx, andMlxBackendis zero-sized, soBackendis itself zero-sized with no discriminant.select_backend()(#[inline],#[must_use]) is a constant constructor that always returns that one variant with no environment read and no branch. EveryBackendmethod is a single-armmatchmarked#[inline]. After inlining (release builds use fat LTO andcodegen-units = 1),select_backend().load_model(p)lowers to a direct call to the existing MLX loader, identical to the pre-seam build. The env-reading selection path and the second enum variant only exist under#[cfg(feature = "experimental-backend")], so they are absent from default codegen entirely.How behavior preservation is ensured
Behavior is preserved by construction: the control-plane load call sites now go through the seam, and the seam methods delegate to the unchanged loaders, so the same loading and forward code runs. Rerouted call sites:
src/commands/generate.rs(primary load: tensor-parallel / adapter / plain, plus the offline draft-model load),src/commands/chat.rs(REPL load), andsrc/server/model_worker.rs(both the batched and the legacy sequential worker loops, each covering tensor-parallel / adapter / plain). The pipeline-parallel branch keeps its own distributed loader and does not go through the seam.What changed
src/backend/mod.rs(new):ComputeBackendtrait,Backendenum,select_backend().src/backend/mlx.rs(new):MlxBackend, delegates tocrate::loading.src/backend/experimental.rs(new,cfg-gated): scaffold plug-in slot for a non-MLX engine.src/backend/tests.rs(new): scoped seam tests.src/lib.rs:pub mod backend;and re-export ofBackend,ComputeBackend,MlxBackend,select_backend.Cargo.toml: default-offexperimental-backendfeature.src/commands/generate.rs,src/commands/chat.rs,src/server/model_worker.rs: route loads throughselect_backend().Test plan
cargo check --lib --features metal,acceleratecargo check --bins --features metal,acceleratecargo check --lib --features metal,accelerate,experimental-backend(gated module compiles)cargo clippy --lib --tests --features metal,accelerate -- -D warningscargo clippy --lib --features metal,accelerate,experimental-backend -- -D warningscargo test --lib backend:: --features metal,accelerate(3 tests pass)cargo fmt --checkCloses #338