Skip to content

Fix python tests on CUDA#9215

Open
JPPhoto wants to merge 5 commits into
invoke-ai:mainfrom
JPPhoto:fix-python-cuda-tests
Open

Fix python tests on CUDA#9215
JPPhoto wants to merge 5 commits into
invoke-ai:mainfrom
JPPhoto:fix-python-cuda-tests

Conversation

@JPPhoto

@JPPhoto JPPhoto commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fix CUDA test and runtime compatibility issues in bnb custom linear autocast paths.

  • Avoid the bitsandbytes NF4 single-vector gemv_4bit path for CPU-stored, device-autocasted weights by using the dequantized linear path for that shape.
  • Update CUDA tests to account for current bitsandbytes/PyTorch failure modes, including ValueError from raw 8-bit CPU-weight inference.
  • Avoid unsafe raw NF4 CPU-weight probes in tests and skip unsupported DoRA + bnb quantized layer combinations.

Related Issues / Discussions

@keturn reported issues on Discord: https://discord.com/channels/1020123559063990373/1049495067846524939/1506160085108396092

QA Instructions

  1. Validate in a CUDA environment:
python -m pytest \
  tests/app/util/test_torch_cuda_allocator.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_custom_invoke_linear_8_bit_lt.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_custom_invoke_linear_nf4.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_bnb_llm_int8_linear_layer \
  tests/backend/patches/test_layer_patcher.py::test_apply_smart_lora_patches_to_partially_loaded_model \
  tests/backend/patches/test_layer_patcher.py::test_apply_smart_model_patches_change_device \
  tests/backend/quantization/test_bnb_llm_int8.py \
  tests/backend/quantization/gguf/test_ggml_tensor.py::test_ggml_tensor_to_device \
  tests/backend/model_manager/load/model_cache/cached_model/ \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_all_custom_modules.py \
  tests/backend/patches/test_layer_patcher.py \
  tests/backend/patches/layers/test_lora_layer.py \
  tests/backend/patches/layers/test_set_parameter_layer.py \
  tests/backend/quantization/gguf/test_ggml_tensor.py
  1. Validate in any environment:
CUDA_VISIBLE_DEVICES="" python -m pytest \
  tests/app/util/test_torch_cuda_allocator.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_custom_invoke_linear_8_bit_lt.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_custom_invoke_linear_nf4.py \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/test_torch_module_autocast.py::test_torch_module_autocast_bnb_llm_int8_linear_layer \
  tests/backend/patches/test_layer_patcher.py::test_apply_smart_lora_patches_to_partially_loaded_model \
  tests/backend/patches/test_layer_patcher.py::test_apply_smart_model_patches_change_device \
  tests/backend/quantization/test_bnb_llm_int8.py \
  tests/backend/quantization/gguf/test_ggml_tensor.py::test_ggml_tensor_to_device \
  tests/backend/model_manager/load/model_cache/cached_model/ \
  tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_all_custom_modules.py \
  tests/backend/patches/test_layer_patcher.py \
  tests/backend/patches/layers/test_lora_layer.py \
  tests/backend/patches/layers/test_set_parameter_layer.py \
  tests/backend/quantization/gguf/test_ggml_tensor.py
  1. Try various models to make sure that generation isn't impacted.

Merge Plan

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

@github-actions github-actions Bot added python PRs that change python files backend PRs that change backend files python-tests PRs that change python tests labels May 19, 2026
@JPPhoto JPPhoto force-pushed the fix-python-cuda-tests branch from 60ec322 to 1b04429 Compare May 22, 2026 03:37
@lstein lstein added the 6.13.5 Library Updates label May 25, 2026
@lstein lstein moved this to 6.13.5 LIBRARY UPDATES in Invoke - Community Roadmap May 25, 2026
@JPPhoto JPPhoto force-pushed the fix-python-cuda-tests branch 4 times, most recently from 0d374ff to b40cdb3 Compare May 29, 2026 11:02
@Pfannkuchensack

Copy link
Copy Markdown
Collaborator

Findings

  • Low -- invokeai/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/custom_invoke_linear_nf4.py:78-84 -- the new if x.numel() == x.shape[-1]: branch is gated on input shape, not on the actual failure condition the comment describes (CPU-stored, device-autocasted Params4bit on certain CUDA/bnb combos). It therefore fires on every single-vector input on all platforms, including GPU-resident weights where gemv_4bit works fine, replacing the fused single-row kernel with dequantize_4bit + F.linear.

    • Scenario: partially-loaded (low-VRAM) NF4 FLUX generating at the default batch size 1 -- the canonical NF4 case. t_vec/guidance_vec are (1,), timestep embedding (1,256), pooled CLIP (1,768), and every Modulation/adaLN/MLPEmbedder Linear receives a (1,dim) input where numel == shape[-1]. For FLUX dev that is ~83 Linears per forward pass that now dequantize a dense weight instead of using gemv_4bit. Doubles under CFG (separate batch-1 pos/neg passes).
    • Evidence: the condition mirrors bnb's gemv trigger (A.numel()==A.shape[-1] and not requires_grad), so it intercepts exactly the inputs bnb would route to gemv_4bit; quantize_model_nf4(..., modules_to_not_convert=set()) in invokeai/backend/flux/util.py makes all these embedder/modulation Linears InvokeLinearNF4; the partial-load path enables autocasting, so _autocast_forward runs.
    • Impact is bounded, not catastrophic: the dominant compute (attention qkv/proj, MLP over the full (1,seq_len,3072) sequence) keeps numel >> shape[-1] and stays on matmul_4bit. Peak VRAM cost is the largest single transient (the 3072->18432 modulation weight, ~113 MB bf16, freed each call and reused by the caching allocator), not the ~6.5 GB of alloc/free traffic per pass. So this is a modest, real per-step overhead on the small Linears, accepted as a correctness-for-speed trade where gemv was crashing -- but it is applied unconditionally rather than narrowed to the offloaded-weight case that actually fails.
    • To expose this issue, add a test that runs CustomInvokeLinearNF4._autocast_forward with a (1, in_features) CUDA input and a GPU-resident weight and asserts the gemv path is taken (e.g. bnb.functional.dequantize_4bit is not called) -- i.e. gate the dequant branch on weight-on-CPU rather than on input shape so GPU-resident single-vector inference keeps using gemv_4bit.
  • Low -- tests/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/test_all_custom_modules.py:350-355,580-619 -- the new pytest.skip("DoRA patches ... not compatible with bnb quantized layers") masks a path that was erroring, not passing, and replaces it with nothing that asserts the guard.

    • Scenario: on CUDA, the parametrized cases test_quantized_linear_sidecar_patches[cuda-single_dora-invoke_linear_nf4] and [...-invoke_linear_8_bit_lt]. single_dora already existed in the shared patch_under_test fixture on origin/main (added in PR 9063) and those tests had no skip guard, so the cases ran on CUDA.
    • Evidence: a DoRALayer is neither LoRALayer nor FluxControlLoRALayer, so in invokeai/backend/model_manager/load/model_cache/torch_module_autocast/custom_modules/custom_linear.py it falls to unprocessed_patches; _cast_weight_bias_for_input returns a torch.empty(..., device="meta") weight (custom_invoke_linear_nf4.py:37); DoRALayer.get_parameters (invokeai/backend/patches/layers/dora_layer.py) then raises RuntimeError on the meta device. The test had only a positive allclose, no pytest.raises, so on CUDA it ERRORED. The skip silences that, and no test anywhere now asserts the guard fires -- only the skip strings reference it.
    • This is a coverage gap, not a production defect: DoRA genuinely cannot read bnb-quantized weights and the raise is intended. But the unsupported combination should be locked in, not erased.
    • To expose this issue, add a test that wraps an NF4 (and an 8-bit) custom layer, attaches a DoRALayer sidecar, runs forward(), and asserts pytest.raises(RuntimeError) -- converting the silent skip into an explicit, documented guard assertion.

@JPPhoto JPPhoto force-pushed the fix-python-cuda-tests branch 2 times, most recently from 2005f92 to 09cea23 Compare June 12, 2026 05:37
@lstein

lstein commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

@Pfannkuchensack It looks like this is ready for a re-review.

@JPPhoto JPPhoto force-pushed the fix-python-cuda-tests branch from 09cea23 to 5bd37d9 Compare June 22, 2026 00:11
@JPPhoto

JPPhoto commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

@Pfannkuchensack I think I've addressed your concerns. Let me know how things look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.13.5 Library Updates backend PRs that change backend files python PRs that change python files python-tests PRs that change python tests

Projects

Status: 6.13.5 LIBRARY UPDATES

Development

Successfully merging this pull request may close these issues.

3 participants