Fix python tests on CUDA by JPPhoto · Pull Request #9215 · invoke-ai/InvokeAI

JPPhoto · 2026-05-19T18:17:36Z

Summary

Fix CUDA test and runtime compatibility issues in bnb custom linear autocast paths.

Avoid the bitsandbytes NF4 single-vector gemv_4bit path for CPU-stored, device-autocasted weights by using the dequantized linear path for that shape.
Update CUDA tests to account for current bitsandbytes/PyTorch failure modes, including ValueError from raw 8-bit CPU-weight inference.
Avoid unsafe raw NF4 CPU-weight probes in tests and skip unsupported DoRA + bnb quantized layer combinations.

Related Issues / Discussions

@keturn reported issues on Discord: https://discord.com/channels/1020123559063990373/1049495067846524939/1506160085108396092

QA Instructions

Validate in a CUDA environment:

python -m pytest \
  tests/app/util/test_torch_cuda_allocator.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_custom_invoke_linear_8_bit_lt.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_custom_invoke_linear_nf4.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_bnb_llm_int8_linear_layer \
  tests/backend/patches/test_layer_patcher.py::test_apply_smart_lora_patches_to_partially_loaded_model \
  tests/backend/patches/test_layer_patcher.py::test_apply_smart_model_patches_change_device \
  tests/backend/quantization/test_bnb_llm_int8.py \
  tests/backend/quantization/gguf/test_ggml_tensor.py::test_ggml_tensor_to_device \
  tests/backend/model_manager/load/model_cache/cached_model/ \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_all_custom_modules.py \
  tests/backend/patches/test_layer_patcher.py \
  tests/backend/patches/layers/test_lora_layer.py \
  tests/backend/patches/layers/test_set_parameter_layer.py \
  tests/backend/quantization/gguf/test_ggml_tensor.py

Validate in any environment:

CUDA_VISIBLE_DEVICES="" python -m pytest \
  tests/app/util/test_torch_cuda_allocator.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_custom_invoke_linear_8_bit_lt.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_custom_invoke_linear_nf4.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_bnb_llm_int8_linear_layer \
  tests/backend/patches/test_layer_patcher.py::test_apply_smart_lora_patches_to_partially_loaded_model \
  tests/backend/patches/test_layer_patcher.py::test_apply_smart_model_patches_change_device \
  tests/backend/quantization/test_bnb_llm_int8.py \
  tests/backend/quantization/gguf/test_ggml_tensor.py::test_ggml_tensor_to_device \
  tests/backend/model_manager/load/model_cache/cached_model/ \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_all_custom_modules.py \
  tests/backend/patches/test_layer_patcher.py \
  tests/backend/patches/layers/test_lora_layer.py \
  tests/backend/patches/layers/test_set_parameter_layer.py \
  tests/backend/quantization/gguf/test_ggml_tensor.py

Try various models to make sure that generation isn't impacted.

Merge Plan

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
❗Changes to a redux slice have a corresponding migration
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

Pfannkuchensack · 2026-06-02T02:38:44Z

Findings

Low -- invokeai/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/custom_invoke_linear_nf4.py:78-84 -- the new if x.numel() == x.shape[-1]: branch is gated on input shape, not on the actual failure condition the comment describes (CPU-stored, device-autocasted Params4bit on certain CUDA/bnb combos). It therefore fires on every single-vector input on all platforms, including GPU-resident weights where gemv_4bit works fine, replacing the fused single-row kernel with dequantize_4bit + F.linear.
- Scenario: partially-loaded (low-VRAM) NF4 FLUX generating at the default batch size 1 -- the canonical NF4 case. t_vec/guidance_vec are (1,), timestep embedding (1,256), pooled CLIP (1,768), and every Modulation/adaLN/MLPEmbedder Linear receives a (1,dim) input where numel == shape[-1]. For FLUX dev that is ~83 Linears per forward pass that now dequantize a dense weight instead of using gemv_4bit. Doubles under CFG (separate batch-1 pos/neg passes).
- Evidence: the condition mirrors bnb's gemv trigger (A.numel()==A.shape[-1] and not requires_grad), so it intercepts exactly the inputs bnb would route to gemv_4bit; quantize_model_nf4(..., modules_to_not_convert=set()) in invokeai/backend/flux/util.py makes all these embedder/modulation Linears InvokeLinearNF4; the partial-load path enables autocasting, so _autocast_forward runs.
- Impact is bounded, not catastrophic: the dominant compute (attention qkv/proj, MLP over the full (1,seq_len,3072) sequence) keeps numel >> shape[-1] and stays on matmul_4bit. Peak VRAM cost is the largest single transient (the 3072->18432 modulation weight, ~113 MB bf16, freed each call and reused by the caching allocator), not the ~6.5 GB of alloc/free traffic per pass. So this is a modest, real per-step overhead on the small Linears, accepted as a correctness-for-speed trade where gemv was crashing -- but it is applied unconditionally rather than narrowed to the offloaded-weight case that actually fails.
- To expose this issue, add a test that runs CustomInvokeLinearNF4._autocast_forward with a (1, in_features) CUDA input and a GPU-resident weight and asserts the gemv path is taken (e.g. bnb.functional.dequantize_4bit is not called) -- i.e. gate the dequant branch on weight-on-CPU rather than on input shape so GPU-resident single-vector inference keeps using gemv_4bit.
Low -- tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_all_custom_modules.py:350-355,580-619 -- the new pytest.skip("DoRA patches ... not compatible with bnb quantized layers") masks a path that was erroring, not passing, and replaces it with nothing that asserts the guard.
- Scenario: on CUDA, the parametrized cases test_quantized_linear_sidecar_patches[cuda-single_dora-invoke_linear_nf4] and [...-invoke_linear_8_bit_lt]. single_dora already existed in the shared patch_under_test fixture on origin/main (added in PR 9063) and those tests had no skip guard, so the cases ran on CUDA.
- Evidence: a DoRALayer is neither LoRALayer nor FluxControlLoRALayer, so in invokeai/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/custom_linear.py it falls to unprocessed_patches; _cast_weight_bias_for_input returns a torch.empty(..., device="meta") weight (custom_invoke_linear_nf4.py:37); DoRALayer.get_parameters (invokeai/backend/patches/layers/dora_layer.py) then raises RuntimeError on the meta device. The test had only a positive allclose, no pytest.raises, so on CUDA it ERRORED. The skip silences that, and no test anywhere now asserts the guard fires -- only the skip strings reference it.
- This is a coverage gap, not a production defect: DoRA genuinely cannot read bnb-quantized weights and the raise is intended. But the unsupported combination should be locked in, not erased.
- To expose this issue, add a test that wraps an NF4 (and an 8-bit) custom layer, attaches a DoRALayer sidecar, runs forward(), and asserts pytest.raises(RuntimeError) -- converting the silent skip into an explicit, documented guard assertion.

lstein · 2026-06-17T18:21:01Z

@Pfannkuchensack It looks like this is ready for a re-review.

JPPhoto · 2026-06-23T03:15:56Z

@Pfannkuchensack I think I've addressed your concerns. Let me know how things look.

JPPhoto requested review from Pfannkuchensack, blessedcoolant, dunkeroni and lstein as code owners May 19, 2026 18:17

github-actions Bot added python PRs that change python files backend PRs that change backend files python-tests PRs that change python tests labels May 19, 2026

JPPhoto force-pushed the fix-python-cuda-tests branch from 60ec322 to 1b04429 Compare May 22, 2026 03:37

lstein added the 6.13.5 Library Updates label May 25, 2026

lstein added this to Invoke - Community Roadmap May 25, 2026

lstein moved this to 6.13.5 LIBRARY UPDATES in Invoke - Community Roadmap May 25, 2026

lstein assigned Pfannkuchensack May 25, 2026

JPPhoto force-pushed the fix-python-cuda-tests branch 4 times, most recently from 0d374ff to b40cdb3 Compare May 29, 2026 11:02

JPPhoto force-pushed the fix-python-cuda-tests branch 2 times, most recently from 2005f92 to 09cea23 Compare June 12, 2026 05:37

JPPhoto force-pushed the fix-python-cuda-tests branch from 09cea23 to 5bd37d9 Compare June 22, 2026 00:11

JPPhoto and others added 4 commits June 21, 2026 19:11

Fix python tests on CUDA

331e714

Updated more CUDA tests

5bd37d9

Merge branch 'main' into fix-python-cuda-tests

eb8230a

Address CUDA test review feedback

6b7a55b

Merge branch 'main' into fix-python-cuda-tests

09e75ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix python tests on CUDA#9215

Fix python tests on CUDA#9215
JPPhoto wants to merge 5 commits into
invoke-ai:mainfrom
JPPhoto:fix-python-cuda-tests

JPPhoto commented May 19, 2026 •

edited

Loading

Uh oh!

Pfannkuchensack commented Jun 2, 2026

Uh oh!

lstein commented Jun 17, 2026

Uh oh!

JPPhoto commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JPPhoto commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

Uh oh!

Pfannkuchensack commented Jun 2, 2026

Findings

Uh oh!

lstein commented Jun 17, 2026

Uh oh!

JPPhoto commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JPPhoto commented May 19, 2026 •

edited

Loading