Skip to content

fix(qwen): estimate VAE working memory so the cache frees room before decode/encode#9305

Open
lstein wants to merge 1 commit into
mainfrom
fix-qwen-vae-working-memory
Open

fix(qwen): estimate VAE working memory so the cache frees room before decode/encode#9305
lstein wants to merge 1 commit into
mainfrom
fix-qwen-vae-working-memory

Conversation

@lstein

@lstein lstein commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

The Qwen Image qwen_image_l2i (decode) and qwen_image_i2l (encode) invocations called model_on_device() without a working_mem_bytes estimate — unlike the SD/SDXL l2i path, which calls estimate_vae_working_memory_sd15_sdxl(...). As a result, the model cache only reserved the default device_working_mem_gb and never evicted the resident transformer / text encoder before the VAE decode.

On a near-full card this OOMs. Reproduced with Qwen Image Edit 2511 (Q8_0) + the standard Qwen Image VAE on a 48 GB AMD W7900: with the transformer (~20.7 GB) and text encoder (~15.8 GB) resident, the autoencoder decode tried to allocate ~5 GiB into the fragmented ~8 GiB remainder and failed:

CUDA out of memory. Tried to allocate 5.01 GiB. GPU 0 has a total capacity of 44.98 GiB
of which 3.69 GiB is free. ... 2.48 GiB is reserved by PyTorch but unallocated.

Root cause

ModelCache._load_locked_model() computes vram_available = free_vram − working_mem and only evicts other models when that drops below what the locked model needs. The VAE is tiny (~242 MB) and already resident, so model_vram_needed ≈ 0 and nothing is ever evicted — the big transformer/text encoder stay put and the decode is squeezed into whatever fragmented VRAM is left.

Passing a realistic working_mem_bytes lets the cache make room (evicting other models) before the operation runs, which is exactly what the SD/SDXL path already does.

Fix

  • Add estimate_vae_working_memory_qwen_image() in vae_working_memory.py.
  • Pass the estimate into model_on_device(working_mem_bytes=...) in both the decode and encode invocations.

Calibration

The estimate is calibrated against a measured decode on a W7900. At 1248×832 the decode grew CUDA reserved memory by ~10.06 GiB (implied constant ~5082); rounded up to 5500 for headroom. The current SD constant (2200) under-modeled this heavier video-style VAE by ~2.4×.

The constant intentionally tracks peak reserved (not just allocated) memory. The cache's guarantee is "if it doesn't evict, then free ≥ estimate," so the estimate must be ≥ the decode's true reserved footprint. This closes the danger zone where the cache would skip eviction yet the decode would still reserve more than the free VRAM:

  • Tight card (transformer + text encoder resident, ~8 GiB free): estimate ~10.6 GiB > free → cache evicts the text encoder → ~24 GiB clean headroom → decode succeeds.
  • Roomy card (free ≥ estimate): no eviction, but free already ≥ the decode's need → fits.

Testing

  • Reproduced the OOM, then confirmed a clean run at 1248×832 with the calibrated constant and device_working_mem_gb back at its default — the text encoder is offloaded just before the VAE decode and the generation completes.
  • ruff check / ruff format / compile all clean.

Notes / open question

  • The encode constant (2750) follows the SD-style "half of decode" convention and is not independently measured — a conservative default. Worth a follow-up measurement if encode-side OOMs surface (relevant for Qwen Image Edit, which encodes an input image).
  • Calibration was done on ROCm/W7900; expandable_segments:True did not resolve the fragmentation on that stack — eviction is what reliably works.

🤖 Generated with Claude Code

… decode/encode

The Qwen Image l2i/i2l invocations called `model_on_device()` without a
`working_mem_bytes` estimate, unlike the SD/SDXL path. The model cache
therefore only reserved the default `device_working_mem_gb` and never
evicted the resident transformer/text encoder before the VAE decode. On a
near-full card (e.g. Qwen Image Edit Q8_0 with transformer + text encoder
resident) the decode then OOMs trying to allocate its working set into the
fragmented remainder.

Add `estimate_vae_working_memory_qwen_image()` and pass it into both the
decode and encode paths so the cache makes room (evicting other models when
needed) before the operation runs.

The constant is calibrated against a measured decode on an AMD W7900: at
1248x832 the decode grew CUDA reserved memory by ~10.06 GiB (implied
constant ~5082), rounded up to 5500 for headroom. It tracks peak *reserved*
(not just allocated) memory so that whenever the cache declines to free room
(free >= estimate) the decode is still guaranteed to fit. Encode uses ~half,
matching the other estimators (not independently measured).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files labels Jun 26, 2026
@lstein lstein moved this to 6.13.5 LIBRARY UPDATES in Invoke - Community Roadmap Jun 26, 2026
@lstein lstein added the 6.13.5 Library Updates label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.13.5 Library Updates backend PRs that change backend files invocations PRs that change invocations python PRs that change python files

Projects

Status: 6.13.5 LIBRARY UPDATES

Development

Successfully merging this pull request may close these issues.

2 participants