You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Low -- invokeai/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/custom_invoke_linear_nf4.py:78-84 -- the new if x.numel() == x.shape[-1]: branch is gated on input shape, not on the actual failure condition the comment describes (CPU-stored, device-autocasted Params4bit on certain CUDA/bnb combos). It therefore fires on every single-vector input on all platforms, including GPU-resident weights where gemv_4bit works fine, replacing the fused single-row kernel with dequantize_4bit + F.linear.
Scenario: partially-loaded (low-VRAM) NF4 FLUX generating at the default batch size 1 -- the canonical NF4 case. t_vec/guidance_vec are (1,), timestep embedding (1,256), pooled CLIP (1,768), and every Modulation/adaLN/MLPEmbedder Linear receives a (1,dim) input where numel == shape[-1]. For FLUX dev that is ~83 Linears per forward pass that now dequantize a dense weight instead of using gemv_4bit. Doubles under CFG (separate batch-1 pos/neg passes).
Evidence: the condition mirrors bnb's gemv trigger (A.numel()==A.shape[-1] and not requires_grad), so it intercepts exactly the inputs bnb would route to gemv_4bit; quantize_model_nf4(..., modules_to_not_convert=set()) in invokeai/backend/flux/util.py makes all these embedder/modulation Linears InvokeLinearNF4; the partial-load path enables autocasting, so _autocast_forward runs.
Impact is bounded, not catastrophic: the dominant compute (attention qkv/proj, MLP over the full (1,seq_len,3072) sequence) keeps numel >> shape[-1] and stays on matmul_4bit. Peak VRAM cost is the largest single transient (the 3072->18432 modulation weight, ~113 MB bf16, freed each call and reused by the caching allocator), not the ~6.5 GB of alloc/free traffic per pass. So this is a modest, real per-step overhead on the small Linears, accepted as a correctness-for-speed trade where gemv was crashing -- but it is applied unconditionally rather than narrowed to the offloaded-weight case that actually fails.
To expose this issue, add a test that runs CustomInvokeLinearNF4._autocast_forward with a (1, in_features) CUDA input and a GPU-resident weight and asserts the gemv path is taken (e.g. bnb.functional.dequantize_4bit is not called) -- i.e. gate the dequant branch on weight-on-CPU rather than on input shape so GPU-resident single-vector inference keeps using gemv_4bit.
Low -- tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_all_custom_modules.py:350-355,580-619 -- the new pytest.skip("DoRA patches ... not compatible with bnb quantized layers") masks a path that was erroring, not passing, and replaces it with nothing that asserts the guard.
Scenario: on CUDA, the parametrized cases test_quantized_linear_sidecar_patches[cuda-single_dora-invoke_linear_nf4] and [...-invoke_linear_8_bit_lt]. single_dora already existed in the shared patch_under_test fixture on origin/main (added in PR 9063) and those tests had no skip guard, so the cases ran on CUDA.
Evidence: a DoRALayer is neither LoRALayer nor FluxControlLoRALayer, so in invokeai/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/custom_linear.py it falls to unprocessed_patches; _cast_weight_bias_for_input returns a torch.empty(..., device="meta") weight (custom_invoke_linear_nf4.py:37); DoRALayer.get_parameters (invokeai/backend/patches/layers/dora_layer.py) then raises RuntimeError on the meta device. The test had only a positive allclose, no pytest.raises, so on CUDA it ERRORED. The skip silences that, and no test anywhere now asserts the guard fires -- only the skip strings reference it.
This is a coverage gap, not a production defect: DoRA genuinely cannot read bnb-quantized weights and the raise is intended. But the unsupported combination should be locked in, not erased.
To expose this issue, add a test that wraps an NF4 (and an 8-bit) custom layer, attaches a DoRALayer sidecar, runs forward(), and asserts pytest.raises(RuntimeError) -- converting the silent skip into an explicit, documented guard assertion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6.13.5Library UpdatesbackendPRs that change backend filespythonPRs that change python filespython-testsPRs that change python tests
3 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix CUDA test and runtime compatibility issues in bnb custom linear autocast paths.
gemv_4bitpath for CPU-stored, device-autocasted weights by using the dequantized linear path for that shape.ValueErrorfrom raw 8-bit CPU-weight inference.Related Issues / Discussions
@keturn reported issues on Discord: https://discord.com/channels/1020123559063990373/1049495067846524939/1506160085108396092
QA Instructions
CUDA_VISIBLE_DEVICES="" python -m pytest \ tests/app/util/test_torch_cuda_allocator.py \ tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_custom_invoke_linear_8_bit_lt.py \ tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_custom_invoke_linear_nf4.py \ tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_bnb_llm_int8_linear_layer \ tests/backend/patches/test_layer_patcher.py::test_apply_smart_lora_patches_to_partially_loaded_model \ tests/backend/patches/test_layer_patcher.py::test_apply_smart_model_patches_change_device \ tests/backend/quantization/test_bnb_llm_int8.py \ tests/backend/quantization/gguf/test_ggml_tensor.py::test_ggml_tensor_to_device \ tests/backend/model_manager/load/model_cache/cached_model/ \ tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_all_custom_modules.py \ tests/backend/patches/test_layer_patcher.py \ tests/backend/patches/layers/test_lora_layer.py \ tests/backend/patches/layers/test_set_parameter_layer.py \ tests/backend/quantization/gguf/test_ggml_tensor.pyMerge Plan
Checklist
What's Newcopy (if doing a release after this PR)