fix(qwen): estimate VAE working memory so the cache frees room before decode/encode#9305
Open
lstein wants to merge 1 commit into
Open
fix(qwen): estimate VAE working memory so the cache frees room before decode/encode#9305lstein wants to merge 1 commit into
lstein wants to merge 1 commit into
Conversation
… decode/encode The Qwen Image l2i/i2l invocations called `model_on_device()` without a `working_mem_bytes` estimate, unlike the SD/SDXL path. The model cache therefore only reserved the default `device_working_mem_gb` and never evicted the resident transformer/text encoder before the VAE decode. On a near-full card (e.g. Qwen Image Edit Q8_0 with transformer + text encoder resident) the decode then OOMs trying to allocate its working set into the fragmented remainder. Add `estimate_vae_working_memory_qwen_image()` and pass it into both the decode and encode paths so the cache makes room (evicting other models when needed) before the operation runs. The constant is calibrated against a measured decode on an AMD W7900: at 1248x832 the decode grew CUDA reserved memory by ~10.06 GiB (implied constant ~5082), rounded up to 5500 for headroom. It tracks peak *reserved* (not just allocated) memory so that whenever the cache declines to free room (free >= estimate) the decode is still guaranteed to fit. Encode uses ~half, matching the other estimators (not independently measured). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Qwen Image
qwen_image_l2i(decode) andqwen_image_i2l(encode) invocations calledmodel_on_device()without aworking_mem_bytesestimate — unlike the SD/SDXLl2ipath, which callsestimate_vae_working_memory_sd15_sdxl(...). As a result, the model cache only reserved the defaultdevice_working_mem_gband never evicted the resident transformer / text encoder before the VAE decode.On a near-full card this OOMs. Reproduced with Qwen Image Edit 2511 (Q8_0) + the standard Qwen Image VAE on a 48 GB AMD W7900: with the transformer (~20.7 GB) and text encoder (~15.8 GB) resident, the autoencoder decode tried to allocate ~5 GiB into the fragmented ~8 GiB remainder and failed:
Root cause
ModelCache._load_locked_model()computesvram_available = free_vram − working_memand only evicts other models when that drops below what the locked model needs. The VAE is tiny (~242 MB) and already resident, somodel_vram_needed ≈ 0and nothing is ever evicted — the big transformer/text encoder stay put and the decode is squeezed into whatever fragmented VRAM is left.Passing a realistic
working_mem_byteslets the cache make room (evicting other models) before the operation runs, which is exactly what the SD/SDXL path already does.Fix
estimate_vae_working_memory_qwen_image()invae_working_memory.py.model_on_device(working_mem_bytes=...)in both the decode and encode invocations.Calibration
The estimate is calibrated against a measured decode on a W7900. At 1248×832 the decode grew CUDA reserved memory by ~10.06 GiB (implied constant ~5082); rounded up to 5500 for headroom. The current SD constant (
2200) under-modeled this heavier video-style VAE by ~2.4×.The constant intentionally tracks peak reserved (not just allocated) memory. The cache's guarantee is "if it doesn't evict, then
free ≥ estimate," so the estimate must be ≥ the decode's true reserved footprint. This closes the danger zone where the cache would skip eviction yet the decode would still reserve more than the free VRAM:Testing
device_working_mem_gbback at its default — the text encoder is offloaded just before the VAE decode and the generation completes.ruff check/ruff format/ compile all clean.Notes / open question
2750) follows the SD-style "half of decode" convention and is not independently measured — a conservative default. Worth a follow-up measurement if encode-side OOMs surface (relevant for Qwen Image Edit, which encodes an input image).expandable_segments:Truedid not resolve the fragmentation on that stack — eviction is what reliably works.🤖 Generated with Claude Code