fix(qwen): estimate VAE working memory so the cache frees room before decode/encode by lstein · Pull Request #9305 · invoke-ai/InvokeAI

lstein · 2026-06-26T03:25:48Z

Summary

The Qwen Image qwen_image_l2i (decode) and qwen_image_i2l (encode) invocations called model_on_device() without a working_mem_bytes estimate — unlike the SD/SDXL l2i path, which calls estimate_vae_working_memory_sd15_sdxl(...). As a result, the model cache only reserved the default device_working_mem_gb and never evicted the resident transformer / text encoder before the VAE decode.

On a near-full card this OOMs. Reproduced with Qwen Image Edit 2511 (Q8_0) + the standard Qwen Image VAE on a 48 GB AMD W7900: with the transformer (~20.7 GB) and text encoder (~15.8 GB) resident, the autoencoder decode tried to allocate ~5 GiB into the fragmented ~8 GiB remainder and failed:

CUDA out of memory. Tried to allocate 5.01 GiB. GPU 0 has a total capacity of 44.98 GiB
of which 3.69 GiB is free. ... 2.48 GiB is reserved by PyTorch but unallocated.

Root cause

ModelCache._load_locked_model() computes vram_available = free_vram − working_mem and only evicts other models when that drops below what the locked model needs. The VAE is tiny (~242 MB) and already resident, so model_vram_needed ≈ 0 and nothing is ever evicted — the big transformer/text encoder stay put and the decode is squeezed into whatever fragmented VRAM is left.

Passing a realistic working_mem_bytes lets the cache make room (evicting other models) before the operation runs, which is exactly what the SD/SDXL path already does.

Fix

Add estimate_vae_working_memory_qwen_image() in vae_working_memory.py.
Pass the estimate into model_on_device(working_mem_bytes=...) in both the decode and encode invocations.

Calibration

The estimate is calibrated against a measured decode on a W7900. At 1248×832 the decode grew CUDA reserved memory by ~10.06 GiB (implied constant ~5082); rounded up to 5500 for headroom. The current SD constant (2200) under-modeled this heavier video-style VAE by ~2.4×.

The constant intentionally tracks peak reserved (not just allocated) memory. The cache's guarantee is "if it doesn't evict, then free ≥ estimate," so the estimate must be ≥ the decode's true reserved footprint. This closes the danger zone where the cache would skip eviction yet the decode would still reserve more than the free VRAM:

Tight card (transformer + text encoder resident, ~8 GiB free): estimate ~10.6 GiB > free → cache evicts the text encoder → ~24 GiB clean headroom → decode succeeds.
Roomy card (free ≥ estimate): no eviction, but free already ≥ the decode's need → fits.

Testing

Reproduced the OOM, then confirmed a clean run at 1248×832 with the calibrated constant and device_working_mem_gb back at its default — the text encoder is offloaded just before the VAE decode and the generation completes.
ruff check / ruff format / compile all clean.

Notes / open question

The encode constant (2750) follows the SD-style "half of decode" convention and is not independently measured — a conservative default. Worth a follow-up measurement if encode-side OOMs surface (relevant for Qwen Image Edit, which encodes an input image).
Calibration was done on ROCm/W7900; expandable_segments:True did not resolve the fragmentation on that stack — eviction is what reliably works.

🤖 Generated with Claude Code

… decode/encode The Qwen Image l2i/i2l invocations called `model_on_device()` without a `working_mem_bytes` estimate, unlike the SD/SDXL path. The model cache therefore only reserved the default `device_working_mem_gb` and never evicted the resident transformer/text encoder before the VAE decode. On a near-full card (e.g. Qwen Image Edit Q8_0 with transformer + text encoder resident) the decode then OOMs trying to allocate its working set into the fragmented remainder. Add `estimate_vae_working_memory_qwen_image()` and pass it into both the decode and encode paths so the cache makes room (evicting other models when needed) before the operation runs. The constant is calibrated against a measured decode on an AMD W7900: at 1248x832 the decode grew CUDA reserved memory by ~10.06 GiB (implied constant ~5082), rounded up to 5500 for headroom. It tracks peak *reserved* (not just allocated) memory so that whenever the cache declines to free room (free >= estimate) the decode is still guaranteed to fit. Encode uses ~half, matching the other estimators (not independently measured). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

lstein requested review from JPPhoto, Pfannkuchensack, blessedcoolant and dunkeroni as code owners June 26, 2026 03:25

github-actions Bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files labels Jun 26, 2026

lstein assigned Pfannkuchensack Jun 26, 2026

lstein added this to Invoke - Community Roadmap Jun 26, 2026

lstein moved this to 6.13.5 LIBRARY UPDATES in Invoke - Community Roadmap Jun 26, 2026

lstein added the 6.13.5 Library Updates label Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(qwen): estimate VAE working memory so the cache frees room before decode/encode#9305

fix(qwen): estimate VAE working memory so the cache frees room before decode/encode#9305
lstein wants to merge 1 commit into
mainfrom
fix-qwen-vae-working-memory

lstein commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lstein commented Jun 26, 2026

Summary

Root cause

Fix

Calibration

Testing

Notes / open question

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants