docs: ADR 0004 compute-backend session seam + StableHLO/MLIR family direction#447
Open
inureyes wants to merge 2 commits into
Open
docs: ADR 0004 compute-backend session seam + StableHLO/MLIR family direction#447inureyes wants to merge 2 commits into
inureyes wants to merge 2 commits into
Conversation
…MLIR family direction
…allel tracks, export-first model definition, quantized success bar)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed ADR (status: Proposed, awaiting maintainer acceptance) that reframes the compute-backend seam direction, drafted per the discussion following #338 / #446.
Why
The refactor's purpose is to host FuriosaAI (TCP/RNGD), Tenstorrent, and an OpenXLA-based path. The seam shipped in #446 draws the boundary at model load and returns the concrete MLX
LoadedModel, which is insufficient:LanguageModel::forwardis itself MLX-coupled (it takes&mut [KVCache]and returnsUniquePtr<MlxArray>), and all three targets are graph-compiler backends, not eager-op, so neither a load factory nor an op-level trait fits.What the ADR decides
Draw the seam at the inference-session / engine level with a token-level contract (prefill + decode-step, on-device sampling, backend-owned KV, returns token ids plus optional logprobs), keeping the MLX hot path internal so fusion,
mx.compile, paged KV, and prompt-cache detach/adopt are preserved.CxxGeneratorbecomes the MLX session implementation. Serve the non-MLX targets with one StableHLO/MLIR compiler-family backend (OpenXLA + Tenstorrent TT-MLIR, plus Furiosa if its compiler ingests StableHLO) rather than per-vendor engines, collapsing the execution families to two: MLX eager and StableHLO-compiler. MLX stays the full-featured reference backend.The
select_backendselection skeleton and the default-offexperimental-backendfeature gate from #446 are kept; theComputeBackendtrait contract from #446 is marked provisional and superseded by the session contract before any non-MLX engine lands.Open follow-ups named (not resolved here)
Compiler-family model-definition strategy (StableHLO emission), the KV / paged-KV / scheduler coupling to the MLX
KVCachetype (MLX-only at first, abstracted later), and the Furiosa StableHLO ingestion feasibility unknown.Validation plan
Prove the session contract with OpenXLA as a second reference backend on one or two hot models before the contract is locked and this ADR moves to Accepted.
This PR is documentation only and should not be merged until the direction is accepted. References #338 and PR #446.