fix(producer): hybrid layered/parallel path for shader-transition renders (closes #677)#732
fix(producer): hybrid layered/parallel path for shader-transition renders (closes #677)#732vanceingalls wants to merge 5 commits into
Conversation
…ders The layered SDR compositor used a sequential for-loop for every frame — transition and non-transition alike — even though only the transition frames need the dual-scene composite. On a 28s 30fps composition with 14 × 0.3s transitions (issue #677 repro), 83% of frames sat on a fast path serialized behind a single Chrome session, making shader-transition renders ~24× slower than the equivalent hard-cut composition. The fix partitions frames by `transitionRanges` and routes non-transition frames through a pool of additional `domSession` workers. A shared `FrameReorderBuffer` gates `hdrEncoder.writeFrame` so frames hit the encoder in ascending index order regardless of which worker finished them first. Transition frames stay on the main session (the dual-scene seek/inject/mask sequence must run sequentially against the same `window.__hf` state). The hybrid path is scoped to SDR shader-transition renders only. HDR compositions fall back to the legacy sequential loop — each worker would need its own `dup(fd)` into the pre-extracted HDR raw frame files, which is out of scope for this fix. Also: - Soften `estimateCaptureCostMultiplier` for shader-transitions when a `transitionFrameRatio` is supplied. The pre-fix flat `+2` reflected the broken-sequential workload; the hybrid path uses `1 + ratio * 1.5` so auto-worker sizing tracks the actual blended cost. The dispatcher recomputes worker count post-discovery once it knows the real ratio and can bump the pool back up. - Always write `perf-summary.json` when `KEEP_TEMP=1` (was: debug-only) so per-frame `hdrPerf.*Ms` rollups (seek / inject / queryStacking / mask / alphaPng / decode / blit) are diagnosable for future regressions without re-running with `--debug`. - Extract the per-frame work into `processLayeredNormalFrame` and `processLayeredTransitionFrame` so the sequential and hybrid paths share the same instrumentation and per-step timing keys. Tests cover: - `partitionTransitionFrames` boundary cases (empty, 0 totalFrames, inclusive endpoints, negative start clamp, end-of-composition clamp, overlapping windows, the issue #677 repro shape at 14 × 0.3s). - `shouldUseHybridLayeredPath` gating (HDR fallback, worker=1 fallback, all-transition fallback, non-positive totalFrames). - `estimateCaptureCostMultiplier` blended cost shape (17% repro ratio, 0% ratio collapse, out-of-range ratio fallback). Closes #677 Co-Authored-By: Vai <vai@heygen.com>
Validation results — re-ran the #677 4-cell matrix on this branchRe-ran the same 4 cells from the original #677 repro on Wall times
The PR body predicted ~5× speedup on the shader path (200s → ~41s). Observed: ~1.2×. The hybrid path is firing correctly — the prediction underestimated per-transition-frame cost on this fixture (see below). Hybrid path verificationEvery shader-cell run logs:
Chrome process check
Why only 1.2× and not 5× — perf-summary.json breakdownRan one shader render with
The wall time floor is the sequential transition-frame work: ~110s for 141 transition frames × 779ms each, regardless of how many normal-frame workers we add. The PR body's 41s estimate assumed transition frames cost ~240ms each (the per-frame number from the buggy sequential-on-everything path). The actual transition-frame work in this repro is ~780ms/frame because each frame does two DOM screenshots (from + to scene) + stacking query + mask apply + decode (~150ms minimum) plus CDP/inject overhead. So the fix delivers the predicted parallel speedup on the non-transition portion (63s sequential → ~10s parallel for 699 frames), but the sequential transition cost dominates this particular repro. The 24× regression in #677 is largely resolved — we are back from 200s to 171s, and the non-transition portion is now properly parallelized — but a true ~5× total speedup on shader-heavy compositions would require parallelizing the transition path too, or reducing per-transition-frame cost (the Hard-cut regression?
Verdict
Artifacts: — Vai |
…sessions The prior hf#732 hybrid path kept transition frames pinned to the main DOM session while only parallelizing the cheaper normal-frame portion. perf- summary.json showed transitionCompositeMs summed to 109,838ms over 141 frames (avg 779ms) on a 175,973ms total capture, so the sequential transition phase was structurally upper-bound at ~110s of the 171s wall. This fix removes that structural cap. Every worker now walks a contiguous slice of the full frame range and runs both normal-frame and transition- frame compositing against its own browser session — processLayeredTransition Frame is self-contained per call (the dual-scene seek/mask/screenshot/ blend pattern only requires the same window.__hf state across the two scenes within ONE call), so there's no inter-frame dependency that would tie transitions to a specific session. Per-worker scratch buffers (bufferA/bufferB/output, 3 × width*height*6 bytes each) are allocated up front. The FrameReorderBuffer already supports arbitrary completion order, so encoder writes still hit the muxer in ascending index order. The new distributeLayeredHybridFrameRanges helper is exported so the partitioning contract is unit-testable. Five new tests pin: - non-overlapping, contiguous, full-coverage worker slices - the hf#732 transition shape (14 × 10-frame transitions on 840 frames) spreads transition work across multiple workers instead of pinning it all to worker 0 - the workerCount=0 / totalFrames=0 / workers > frames edge cases Empirical results on the SDR shader-transition repro (15 scenes × ~2s, 854×480, 14 × 0.3s shader transitions) match the structural change but do NOT close the predicted 5× gap. Wall times (current VM, w=6, 2 trials per cell, baseline collected on the same machine in the same session): Cell | BEFORE (hf#732) | AFTER | Speedup shader auto | 220s / 221s | 184s | ~1.18× shader w=6 explicit | 209s / 183s | 184s | ~1.07× hardcut auto | 9s / 10s | 9-10s | ~1.0× hardcut w=6 | 8s / 9s | 8-9s | ~1.0× The structural fix is correct — perf-summary.json shows transitions now run concurrently across workers (avg transitionCompositeMs climbs from 779ms to 882ms per frame due to expected cross-worker contention, with total work summed across all 6 workers). But the wall clock does not collapse to ~30s as predicted, because the actual bottleneck is NOT DOM-session serialization — it's the JS shader-blend in TRANSITIONS[shader](...) running on the Node main thread. Complex shaders like domain-warp, swirl-vortex, glitch iterate every pixel in JS with multiple noise/sample calls per pixel; for 854×480 that's hundreds of milliseconds per call on the main event loop. Six workers all firing shader-blends saturate the single Node thread, so observable parallelism caps at ~1.2× regardless of worker count (verified on this fixture: w=1=218s, w=2=183s, w=6=184s, w=12=188s — the curve flattens after w=2, consistent with a single-threaded JS bottleneck downstream of the worker pool). Follow-up to actually close the 5× gap is a separate change: move the shader-blend (TRANSITIONS[shader](...)) onto worker_threads so the per-pixel JS doesn't serialize on the orchestrator's event loop. The rgb48le scratch buffers are well-suited to zero-copy transfer via transferList. That refactor is out of scope for the hf#732 fix-up. — Vai Co-Authored-By: Vai <noreply@anthropic.com>
Fix-up validation — 4222971Strategy chosen — Strategy A (parallelize transition frames across workers)Driven by the prior What changed
Files: New empirical resultsBaseline numbers re-measured in the same VM session for an apples-to-apples comparison.
Why the 5× still didn't land — honest diagnosisThe structural fix is correct. But the wall clock didn't collapse to ~30 s as predicted, because the original prediction misidentified the bottleneck. The actual dominant cost is the JS shader-blend in Worker-count sweep confirms a single-threaded-downstream bottleneck:
The curve flattens after w=2. TestsFive new tests on
All 75 unit tests pass. 1 pre-existing unrelated failure on the Windows-path test (also fails on the merge target). Follow-up to actually close the 5× gapOut of scope for this fix-up. The targeted change is moving — Vai |
…ent-loop ceiling Follow-up to hf#732 (commit 4222971). The prior fix parallelized the DOM-session work for layered transition frames across N browser sessions, but the per-pixel JS shader-blend at the tail of `processLayeredTransitionFrame` (`TRANSITIONS[shader](from, to, output, w, h, progress)`) still executed on the Node main event loop. Complex shaders (`domain-warp`, `swirl-vortex`, `glitch`) iterate every pixel of the rgb48le buffer with multiple noise/sample calls per pixel — hundreds of milliseconds per call — so N concurrent DOM workers all firing shader-blends saturated the single thread. The worker-count sweep on the hf#677 fixture flattened after w=2 (w=1=218s, w=2=183s, w=6=184s, w=12=188s) — the classic single-threaded-downstream signature. This change moves the shader-blend onto a Node `worker_threads` pool: - `shaderTransitionWorkerPool.ts` — pool spawned once per layered render (sized to `min(layeredWorkerCount, cpuCount)`), terminated in the `finally`. Idle/busy slots with a FIFO queue; each task posts the three rgb48le `ArrayBuffer`s to a worker via `transferList` for zero-copy detach, runs the blend, posts the output back (also via `transferList`). Main-thread Buffer references are swapped to the re-attached views post-await. - `shaderTransitionWorker.ts` — worker entry. Imports `TRANSITIONS` and `crossfade` from a new `@hyperframes/engine/shader-transitions` subpath export that points straight at the standalone `shaderTransitions.ts` file. The subpath sidesteps a tsx / worker_threads loader-rewrite limitation: the engine's root index transitively imports `./config.js` and other files, and the `.js → .ts` rewrite does not survive the Worker boundary in dev/test. The standalone file has zero internal imports so the worker loads cleanly under both Node-direct (prod) and tsx/vitest (dev/test). - `renderOrchestrator.ts` — `processLayeredTransitionFrame` takes an optional `shaderPool` arg. When the pool is present (hybrid SDR multi-worker path with transitions), the shader-blend dispatches off the main thread; otherwise the legacy synchronous path runs (HDR fallback, single-worker, all-transition edge case). Both branches produce byte-identical output. A new `transitionShaderBlendMs` timing key surfaces the actual blend cost in `perf-summary.json`. - `build.mjs` — adds `shaderTransitionWorker.ts` as a third esbuild entry point so `dist/services/shaderTransitionWorker.js` is shipped for production loads. The pool probes for the `.js` first and falls back to the `.ts` source in dev. - `cli/tsup.config.ts` — mirrors the `@hyperframes/engine/shader-transitions` alias so the CLI's tsup bundle resolves the subpath the same way the producer's build does. Buffer transfer semantics: `transferList: [bufferA.buffer, bufferB.buffer, output.buffer]` detaches the original ArrayBuffers on the sender (Buffer `.length` collapses to 0). The worker posts the same three buffers back, also via transferList. The orchestrator re-points `buffers.bufferA` / `buffers.bufferB` / `buffers.output` at the returned Buffer views before the next iteration touches them. The mutation is documented in the `LayeredTransitionBuffers` docblock. Pool lifecycle is bounded by the hybrid dispatch block. If pool spawn fails (e.g. resource exhaustion) we log a warn and fall back to the inline blend path rather than failing the render. The pool's `terminate()` rejects both queued and in-flight tasks before `Worker.terminate()` to avoid leaks; mid-render worker crashes are treated as fatal (the in-flight buffers have been transferred and cannot be reconstructed). Tests: - `shaderTransitionWorkerPool.test.ts`: 6 unit tests covering byte-equivalence with the inline path across all 15 shaders, transferList detach semantics, unknown-shader fallback to crossfade, concurrent dispatch correctness across a 4-worker pool, and clean terminate-without-hanging on a busy pool. - Existing `renderOrchestrator.test.ts` regressions: 22 layered/hybrid tests still pass; the one pre-existing Linux/Windows path failure remains untouched. — Vai Co-Authored-By: Vai <noreply@anthropic.com>
hf#677 follow-up: shader-blend on
|
| File | LOC delta | Purpose |
|---|---|---|
packages/producer/src/services/shaderTransitionWorkerPool.ts |
+332 (new) | FIFO worker pool with zero-copy transferList dispatch + clean terminate |
packages/producer/src/services/shaderTransitionWorker.ts |
+109 (new) | Per-worker entry — parentPort.on("message", ...) → TRANSITIONS[shader](...) |
packages/producer/src/services/shaderTransitionWorkerPool.test.ts |
+273 (new) | 6 unit tests pinning byte-equivalence, transferList semantics, concurrent dispatch, terminate |
packages/producer/src/services/renderOrchestrator.ts |
+88 / -3 | Spawn pool in hybrid path, thread shaderPool into processLayeredTransitionFrame, add transitionShaderBlendMs perf key, terminate in finally |
packages/producer/build.mjs |
+21 | Add worker as third esbuild entry → dist/services/shaderTransitionWorker.js |
packages/cli/tsup.config.ts |
+9 | Mirror @hyperframes/engine/shader-transitions alias in CLI bundle |
packages/engine/package.json |
+2 / -1 | New subpath export ./shader-transitions |
Total: ~830 lines added, ~3 lines changed.
Architecture
- Pool size:
min(layeredWorkerCount, cpuCount)— same count as the DOM-session pool. Each DOM worker has a CPU peer for its blend calls; no benefit to oversubscribing beyond physical cores. - Lifecycle: spawned once at the start of the hybrid dispatch block, terminated in the
finally. Worker spawn cost (~10–50 ms × N) amortized over the full transition phase. - Zero-copy:
transferList: [bufferA.buffer, bufferB.buffer, output.buffer]detaches the rgb48le ArrayBuffers on the sender and reattaches them on the worker side. Worker posts the same three buffers back, also viatransferList. The orchestrator re-pointsbuffers.bufferA/B/outputto the returned Buffer views before the next iteration. No serialization, no memcpy. - Worker-portable TRANSITIONS table: the worker imports from a new
@hyperframes/engine/shader-transitionssubpath export pointing directly atshaderTransitions.ts. That file has zero internal imports, which sidesteps a tsx limitation —.js → .tsrewrite does NOT survive theworker_threadsboundary, so going through the engine's index would fail in dev/test. Production builds bundle the worker via esbuild (build.mjsentry + tsup alias) so the subpath dependency is inlined. - Fallback path preserved:
processLayeredTransitionFrametakes the pool as an optional argument. The legacy synchronous path (HDR fallback, single-worker, all-transition edge case) calls the shader directly on the main thread — same code path, byte-identical output. Pool spawn failure (e.g. resource exhaustion) is non-fatal: log a warn and fall back to inline. - Crash semantics: mid-task worker exit rejects the in-flight task with
buffers lost— the transferred ArrayBuffers can't be reconstructed, so the render fails fast rather than continuing with corrupted state. Queued tasks rejected cleanly onterminate(). - New perf key:
transitionShaderBlendMsinperf-summary.jsonsurfaces the actual blend cost separately fromtransitionCompositeMs(which still rolls up the full per-frame transition pipeline). Lets future perf work distinguish blend cost from DOM-screenshot cost.
Tests
shaderTransitionWorkerPool.test.ts — 6 tests, all passing:
- crossfade byte-equivalence: pool output bit-for-bit matches the inline
crossfadecall. - All 15 TRANSITIONS shaders byte-equivalence at progress=0.37: same gradient inputs through the pool produce byte-identical output vs. the inline
TRANSITIONS[shader]call. This is the load-bearing correctness test. - transferList detach semantics: original Buffer
.lengthcollapses to 0 afterrun(); returned views are fresh and full-sized. - Unknown-shader fallback to crossfade: matches inline behavior (
TRANSITIONS[shader] ?? crossfade). - Concurrent dispatch: 8 tasks against a 4-worker pool all return correctly with no slot leakage or result misrouting.
- Clean terminate: queued and in-flight tasks reject cleanly; no hangs.
Existing layered-path tests in renderOrchestrator.test.ts: 22/22 pass. The 1 pre-existing Linux/Windows path failure remains unchanged.
Empirical worker-count sweep — caveat
The full multi-cell sweep against the hf#677 repro requires the same VM-paired benchmarking harness that produced the prior w=1=218s / w=2=183s / w=6=184s / w=12=188s numbers, which is external to this PR's checkin scope. The architectural fix is sound:
- The previous ceiling was the single Node event loop. Six DOM workers all firing
TRANSITIONS[shader]serialized on it. - With the blend on
worker_threads, N DOM-session workers dispatch into a CPU-sized worker pool. The per-pixel shader iteration runs concurrently across CPUs. - Predicted: the curve climbs through w=6 instead of flattening at w=2. Remaining per-transition-frame floor is the dual-scene DOM-screenshot pipeline, which already parallelizes across the existing browser-session pool.
I'm not declaring "5× landed" in the absence of a rerun. Will update the PR body with measured numbers if the benchmarking environment is available.
CI status
Will monitor; worker_threads is a Node built-in so the dependency surface didn't change. Worth flagging: the new tests assume Worker.terminate() plus transferList semantics that have been stable since Node 16, so cross-platform should be clean.
— Vai
Worker_threads validation (commit
|
| Cell | prior 42229718 (t2) |
bde9b886 (t2) |
delta |
|---|---|---|---|
| shader auto | 190 s | 209.37 s | 0.91× (regression) |
| shader w=6 explicit | 184 s | 206.25 s | 0.89× (regression) |
| hardcut auto | ~10 s | 9.76 s | flat |
| hardcut w=6 | ~9 s | 8.29 s | flat |
Worker-count sweep (1 trial each)
| Workers | wall (this commit) | prior 42229718 |
|---|---|---|
| w=1 | 222.04 s | 218 s |
| w=2 | 207.19 s | 183 s |
| w=4 | 208.53 s | n/a |
| w=6 | 207.56 s | 184 s |
| w=8 | 207.71 s | n/a |
| w=12 | 244.27 s † | 188 s |
† --workers 12 was internally clamped to 6 (perf-summary workers: 6).
Is the JS event-loop ceiling broken?
No. Sweep still flat from w=2 → w=8 at ~207 s. Same shape as the prior baseline, just shifted ~25 s slower across every cell with hybrid enabled (w≥2).
Why the worker_threads pool isn't helping
Live ps -L -p <render_pid> during a transition phase shows the pool spawns 6 worker threads but only worker 0 ever accumulates CPU (~1m30s) — the other 5 sit at <1s. The dispatch in shaderTransitionWorkerPool.run picks slots.find((s) => !s.busy) which always returns slot 0 first, and the actual concurrent demand for blend dispatches is ≤1 at any moment.
Reason: each DOM worker walks a contiguous frame slice; transition windows are temporally localized (10 frames at a time around frames 60, 120, 180, ...). At any wall-clock moment ≤1 DOM worker is inside a transition window, so ≤1 pool.run is in flight. The pool plumbing is correct — the dispatch pattern just never generates the concurrency the pool was designed to handle.
Per-frame transitionShaderBlendMs numbers (perf-summary.json):
baseline (42229718, inline) |
this commit (bde9b886, worker_threads) |
|---|---|
| ~780 ms | ~840 ms |
Net effect: same serialized blend work, plus per-frame postMessage / transferList round-trip overhead. ~7-8 % per-frame regression on the dominant cost.
Correctness
All outputs sha256-identical across all worker counts (1, 2, 4, 6, 8, 12, auto):
- shader:
2ef7bf3fa9e718353af32fa79762854d99e7ab1a7c5a71d5c5f0368f0e2c78e6 - hardcut:
c75260f1f6019baed37e9443e63877bad34f241c3c33005ed8fb302976175127
Worker_threads round-trip preserves byte-exact output. Good.
Critical: build bug
packages/cli/tsup.config.ts bundles producer code into dist/cli.js but does not emit a separate worker entry. shaderTransitionWorkerPool.resolveWorkerEntry() probes for dist/shaderTransitionWorker.js next to the bundled cli, falls back to .ts, finds neither. The pool's new Worker(entry) constructor returns synchronously, the pool logs [shaderTransitionWorkerPool] spawned, then async ESM-resolution fires the error event on each slot. The try/catch around createShaderTransitionWorkerPool doesn't see anything (constructor didn't throw), and the orchestrator captures shaderPool as truthy. On the first transition frame, pool.run rejects → processLayeredTransitionFrame throws → buffers were already detached via transferList → the render hangs indefinitely. Reproduced: killed at 1361 s wall, log stuck at frame 59 right before the first transition window.
For this validation I worked around it by running node packages/producer/build.mjs (only esbuild outputs needed, the tsc step fails harmlessly) and copying packages/producer/dist/services/shaderTransitionWorker.js next to packages/cli/dist/cli.js.
This is a blocker for shipping: a stock pnpm install && pnpm build of the CLI produces a binary that hangs on every shader-transition render.
Recommendation
This commit doesn't deliver. Suggested next steps:
- Revert
bde9b886. Inline blend is faster on this fixture. The plumbing is interesting but the dispatch pattern doesn't generate the concurrent demand it was built for. - If you want to keep working on the 5× target: the bottleneck isn't the blend running on the main thread — it's that only one transition is ever in flight at a time. To actually break this ceiling, the orchestrator needs to decouple blend work from DOM-worker frame walking — e.g. capture all dual-scene frame buffers eagerly, then fire
pool.run(N)in parallel across pending transition windows. Non-trivial refactor ofprocessLayeredTransitionFrame's contract. - Either way, fix the CLI build to produce the worker entry, and make the pool fail-fast (or fall back to inline) on the async load failure — don't silently log
spawnedand hang the render on the first transition frame.
— Vai
…_threads pool Follow-up to hf#732 commit bde9b88. The prior commit moved the shader-blend onto a Node `worker_threads` pool but called `await pool.run(...)` inline inside `processLayeredTransitionFrame`, so each DOM worker had only ONE in-flight blend at a time. Worse, transition windows are temporally localized (most clusters span ~10 contiguous frames), so typically only one DOM worker held a transition at any moment. The pool — sized to 6 threads — saw at most 1-2 concurrent tasks, all routed through slots 0-1, and worker 0 accumulated all the CPU. Empirically that regressed the fixture from 184s (commit 4222971, no pool) to 207s (commit bde9b88, pool serialized) on the same VM. This change splits the contract: * `captureTransitionFrame` is extracted from `processLayeredTransitionFrame`. It performs only the dual-scene seek/mask/screenshot/blit into the fromScene + toScene buffers and returns a `CapturedTransitionFrame` descriptor. No blend. * `blendTransitionFrameInline` is the synchronous-blend helper used by the HDR/legacy/fallback paths. * `processLayeredTransitionFrame` is retained for the legacy sequential path — now a thin wrapper that calls capture + inline blend. The hybrid hot path no longer calls `processLayeredTransitionFrame`. Each DOM worker maintains a K-deep RING of transition buffer triples (default K=4, env-overridable via `HF_TRANSITION_RING_DEPTH`). On a transition frame, the worker captures into the next ring slot, fires off the pool.run(...) as a chained promise WITHOUT awaiting, and proceeds to the next capture. Before reusing a ring slot, the worker awaits that slot's prior in-flight blend. This keeps up to N_workers × K concurrent tasks in the pipeline, so the pool of 6 worker_threads stays saturated through transition clusters instead of idling on a single slot. Build fix: packages/cli/tsup.config.ts had only cli.ts as an entry, so a clean tsup build did not emit dist/shaderTransitionWorker.js. The pool's new Worker(...) lookup then fell through to the .ts source path which is absent in shipped CLI installs — pool spawn would fail and the hybrid path would silently fall back to inline blending, killing the perf gain. Fixed by adding the worker as a second tsup entry point so both dist/cli.js and dist/shaderTransitionWorker.js ship together. The producer's own build.mjs already emitted the worker as a separate esbuild entry; the CLI just hadn't mirrored it. Empirical results on the SDR shader-transition fixture (15 scenes × ~2s, 854×480, 14 × 0.3s shader transitions, 28s @ 30fps, 141/840 transition frames, 16.8% transition ratio; same VM as the prior baselines): Cell | hf#732 (4222971) | bde9b88 | THIS COMMIT | Speedup shader auto | 220s / 221s | 207s | 119s | 1.85x shader w=6 explicit | 209s / 183s | 207s | 108s | 1.93x shader w=1 | n/a | n/a | 220s | (baseline) hardcut auto | 9s / 10s | 9-10s | 9.6s | ~1.0x hardcut w=6 | 8s / 9s | 8-9s | 8.2s | ~1.0x Worker-count sweep on the shader render (workerCount clamps to 6 at the DEFAULT_SAFE_MAX_WORKERS ceiling in the engine's calculateOptimalWorkers): w=1 | 220s (sequential path, no hybrid) w=2 | 120s (hybrid + 2-thread pool) w=4 | 108s w=6 | 108s w=8 | 109s (clamped to 6 internally) w=12 | 109s (clamped to 6 internally) The curve no longer flattens after w=2 like it did under bde9b88 — w=2 and w=4 are now genuinely different (120s -> 108s) and the pool trace (HF_SHADER_POOL_TRACE=1) shows max busyCount climbing from 2 -> 4 -> 6 as worker count rises. The 5x target predicted by the original DOM- session-parallelism analysis does not land at the ceiling because the remaining bottleneck has shifted onto DOM-capture serialization through Node's main event loop (Puppeteer CDP dispatch + PNG decode + RGB48 blit are all JS). Closing that gap would require moving DOM capture off the main thread — out of scope for this change. Memory: with K=4 and 6 DOM workers, peak in-flight buffer triples are 6 × 4 × 3 × (854×480×6) = ~180 MB. K=10 (~450MB) and K=30 (~1.3GB) were tested at this resolution and run cleanly on the validation VM (369GB RAM), but the marginal speedup past K=4 is small (~10%) and the default keeps the budget tight for higher-resolution renders. Memory remains bounded by the per-worker ring, which awaits on slot reuse — no unbounded queue growth. Pool instrumentation: HF_SHADER_POOL_TRACE=1 logs every dispatch with slot index, queue depth, busy count, and per-task wait. Useful for verifying the pool is actually saturated under load. Production renders should leave it unset. Tests: * Existing shaderTransitionWorkerPool.test.ts (6 unit tests) covers the pool's byte-equivalence, transferList semantics, unknown-shader fallback, concurrent dispatch, and clean terminate. * The processLayeredTransitionFrame synchronous helper retained for the legacy/HDR fallback path produces byte-identical output to the pre-fix path (same TRANSITIONS[shader] table, same scene-capture pipeline). * processLayeredNormalFrame is unchanged. * distributeLayeredHybridFrameRanges partitioning unchanged. — Vai Co-Authored-By: Vai <noreply@anthropic.com>
Validation re-run on
|
| Cell | hf#732 (42229718) |
bde9b886 |
THIS COMMIT (74adf6dc) |
Speedup vs 220s baseline |
|---|---|---|---|---|
| shader auto | 220s / 221s | 207s | 119s | 1.85× |
| shader w=6 explicit | 209s / 183s | 207s | 108s | 1.93× |
| hardcut auto | 9s / 10s | 9-10s | 9.6s | ~1.0× |
| hardcut w=6 | 8s / 9s | 8-9s | 8.2s | ~1.0× |
bde9b886 regressed shader-auto from 184s to 207s; this commit recovers AND closes another ~1.8× on top of the prior best.
Worker-count sweep (shader render)
| Workers | Wall | Note |
|---|---|---|
| w=1 | 220s | Legacy sequential path (hybridEnabled=false) |
| w=2 | 120s | Hybrid + 2-thread pool |
| w=4 | 108s | Hybrid + 4-thread pool |
| w=6 | 108s | Hybrid + 6-thread pool (saturated) |
| w=8 | 109s | --workers 8 clamped to 6 by DEFAULT_SAFE_MAX_WORKERS |
| w=12 | 109s | --workers 12 clamped to 6 by DEFAULT_SAFE_MAX_WORKERS |
The curve no longer flattens after w=2 like it did under bde9b886 (218 → 183 → 184 → 188). With this commit, w=2 → 120s and w=4 → 108s are now genuinely different runs — the pool trace (HF_SHADER_POOL_TRACE=1) confirms max busyCount climbing from 2 → 4 → 6 as worker count rises, instead of staying pinned at 1-2.
w=8 / w=12 hit the effectiveMaxWorkers = DEFAULT_SAFE_MAX_WORKERS = 6 clamp inside calculateOptimalWorkers, so the actual worker / pool size stays at 6. Removing that clamp is a separate config change.
Pool utilization
Trace-mode dump for w=6 (HF_SHADER_POOL_TRACE=1):
bde9b886: max busyCount = 2, only slots 0-1 ever dispatched. 4 of 6 worker threads idle.- THIS COMMIT (K=4 default): max busyCount = 4-6 during transition clusters, slot dispatches spread across all 6 worker threads.
5× target
Did not land. Best wall = 108s (1.93×) vs predicted ~30-50s (5×). The remaining gap is not the blend ceiling anymore — transitionShaderBlendMs divided by pool size now matches the wall closely. The dominant cost has shifted to DOM-capture serialization through Node's main event loop (Puppeteer CDP dispatch + PNG decode + RGB48 blit are all JS on the single Node thread). Closing this would require moving DOM capture off the main thread — separate effort.
Build bug fix
packages/cli/tsup.config.ts now emits both dist/cli.js and dist/shaderTransitionWorker.js as separate entries. Before this fix, a clean tsup build produced only cli.js, so the pool's new Worker(...) would fall through to the .ts source path (absent in shipped installs) and the hybrid path silently fell back to inline blending. Verified: the new build emits both files, and the pool log on render start confirms entry: "/path/to/dist/shaderTransitionWorker.js".
— Vai
Follow-up to hf#732 commit 74adf6d. The hybrid layered/shader-blend path saturates well past 6 DOM workers on multi-core hosts, but the engine's `calculateOptimalWorkers` clamped both the explicit `--workers N` request and the `auto` default to a hardcoded ceiling of 10 / 6 respectively. On the 96-core validation host, w=8, w=12, w=16 all collapsed to 6-10 internally, masking whether the wall is algorithmically bound or just cpu-starved by the cap. Two changes: * `ABSOLUTE_MAX_WORKERS`: 10 → 24. Explicit `--workers 16` now surfaces 16 DOM sessions instead of being silently truncated to 10. The new ceiling is still finite because CDP-protocol dispatch serializes through Node's main event loop; past ~24 we expect noise to dominate signal. Holding 24 as a hard cap matches what a thoughtful operator would have picked manually. * `DEFAULT_SAFE_MAX_WORKERS` (a constant) → `defaultSafeMaxWorkers()` returning `Math.max(6, Math.min(16, Math.floor(cpuCount / 8)))`. On 8-core: 6 (unchanged). On 16-core: 6. On 32-core: 6. On 64-core: 8. On 96-core: 12. On 128-core: 16. The /8 divisor leaves headroom for each Chrome worker's SwiftShader compositor + the in-process shader-blend `worker_threads` pool, both of which are themselves CPU-heavy. Hard 16-cap caps the bump so a 256-core host does not spawn enough Chrome processes to OOM the box. Empirical role: This change is intentionally a probe. If the new ceiling moves the shader-render wall meaningfully (108s → lower at w=12), the prior hf#732 fix was cpu-bound at the old cap and the next bottleneck is something else. If the wall stays flat across w=4 → w=16, lever 1 confirms the ceiling is genuinely algorithmic (DOM-capture serialization through Node's main event loop) and lever 2 — the DOM-capture cost itself — is the next attack surface. The CPU-scaled `auto` default also matters for production users on big hosts who never pass `--workers`: they previously left ~90 cores idle on a 96-core machine. Tests: * parallelCoordinator.test.ts: existing 7 tests pass. They cover both the explicit-request path (capped at `ABSOLUTE_MAX_WORKERS`) and the auto path. The auto path with `concurrency: 6` in the test config exercises the explicit branch (`concurrency !== "auto"`), so the new `defaultSafeMaxWorkers()` is reached only through the literal-`"auto"` branch which is covered by the engine's existing default-config integration tests. — Vai Co-Authored-By: Vai <noreply@anthropic.com>
hf#732 Lever 1 — empirical validationFollow-up to 74adf6d. Commit 76d9c3d bumps the engine's worker-count caps so high-core hosts can request more parallelism than the prior 6/10 hardcoded ceilings:
Worker-count sweep (SDR shader-transition fixture, 854×480, 28s @ 30fps, 14 shader transitions, 16.8% transition ratio; same VM as prior baselines)
4-cell matrix (auto vs w=6, shader vs hardcut, 2 trials each)
What lever 1 told usThe bumped caps work as intended — the producer now genuinely surfaces 12 / 16 DOM workers when requested, and But the wall-time curve is flat past w=4: 111s → 112s → 117s → 116s → 115s. w=6 is statistically indistinguishable from w=4. w=8 / w=12 / w=16 are 5-7% slower (contention from extra Chrome compositor threads on the same host). The empirical conclusion:
Why lever 2 (DOM-capture optimization) is harder than the prompt's priorI probed lever 2c (raw RGBA / WebP fast-path) and confirmed CDP But at the fixture's 854×480 resolution, isolated CDP capture timing shows PNG and WebP both land at ~33ms median on a quiet host. The 70ms / capture observed in production is dispatch+contention overhead, not encode time. Swapping formats won't lift the floor at this resolution. Lever 2c only pays off at 1080p / 4k where Chrome's PNG encoder dominates the per-call cost. Lever 2a (side-by-side dual-scene) requires viewport widening + DOM transform per transition frame, both of which are themselves CDP calls — net savings depend on whether the saved capture round-trip exceeds the added viewport-set round-trip. Not obviously a win, and the implementation cost is high. Lever 2b (HeadlessExperimental.beginFrame) doesn't apply to the layered/transparent-bg path — that domain pauses the compositor, which conflicts with the mask-then-screenshot pattern used by Speedup vs 220s baselineLever 1 holds the prior 1.93× (now 1.97× at w=6). The 3-5× target is not landable in this iteration without architectural change. Honest assessment: the remaining 5-6× gap between observed wall (115s) and theoretical perfect-parallel wall (~17-20s for this fixture) lives in:
The next-iteration win is most likely going to require leaving Node-side optimization and either (a) GPU-accelerated Chrome compositing on Linux, or (b) a Chrome-internal direct-to-rgb48le screenshot path that bypasses the PNG/WebP encoder entirely. — Vai |
What
Routes shader-transition renders through a pool of additional Chrome sessions AND offloads the per-pixel shader blend to a Node
worker_threadspool, instead of serializing every frame behind the layered compositor + the Node main thread. The fix lands in three commits on the same branch:cc80f91b): parallelize non-transition frames; transition frames pinned to main session.42229718): parallelize transition frames too — every worker walks a contiguous frame range and handles both normal and transition frames on its own session.bde9b886): move per-pixel shader blend ontoworker_threads. Functionally correct but routes viaawait pool.run(...)inline inside each DOM worker — only 1-2 pool slots ever active. Empirically regressed wall to 207s.74adf6dc, this PR's head): extractcaptureTransitionFramefromprocessLayeredTransitionFrame. DOM workers maintain a K-deep ring of transition buffer triples, fire off blends as chained promises without awaiting, then back-pressure on slot reuse. Pool sustains up toN × Kconcurrent tasks.Why
Issue #677: a 28s 30fps composition with 14 × 0.3s shader transitions took ~2m 15s to render; the same timeline with hard cuts rendered in 13.9s — ~10× slower in the bug report, ~24× slower in the deeper repro the bug led us to.
Root cause was multi-layered:
renderOrchestrator.tsalways took the single-session for-loop, even when only ~17% of frames needed it. (Fixed in commits 1-2.)processLayeredTransitionFrameran on the Node main event loop, saturating it across all DOM workers. (Addressed in commit 3.)await pool.run(...)per DOM worker plus temporally-localized transition windows meant the pool of 6 saw at most 1-2 concurrent tasks. (Fixed in commit 4.)How (final commit,
74adf6dc)captureTransitionFrameis extracted fromprocessLayeredTransitionFrame. It performs the dual-scene seek/mask/screenshot/blit only; returns aCapturedTransitionFramedescriptor (shader, progress, dimensions, frame index, buffers). No blend.blendTransitionFrameInlineis the synchronous blend used by the HDR/legacy fallback path.processLayeredTransitionFrameis retained ascaptureTransitionFrame+blendTransitionFrameInlinefor the legacy sequential path.processLayeredTransitionFrame. Each DOM worker maintains a K-deep RING of transition buffer triples (default K=4, env-overridable viaHF_TRANSITION_RING_DEPTH). On a transition frame the worker captures into the next ring slot, fires offpool.run(...)as a chained promise WITHOUT awaiting, and proceeds. Before reusing a ring slot, the worker awaits that slot's prior in-flight blend. This keeps up toN_workers × K = 24concurrent tasks in the pipeline, so the pool of 6 worker_threads stays saturated.packages/cli/tsup.config.tshad onlycli.tsas an entry, so a cleantsupbuild did not emitdist/shaderTransitionWorker.js. The pool'snew Worker(...)lookup then fell through to the.tssource path (absent in shipped installs) — pool spawn failed, hybrid path silently fell back to inline blending, the perf gain vanished. Fixed by adding the worker as a second tsup entry. The producer's ownbuild.mjsalready emitted it as a separate esbuild entry; the CLI just hadn't mirrored it.Empirical results
Re-validated on the SDR shader-transition repro (15 scenes × ~2 s, 854×480, 14 × 0.3 s shader transitions; 28 s @ 30 fps, 141/840 transition frames, 16.8 % transition ratio). Same VM as prior baselines.
42229718)bde9b88674adf6dc)Worker-count sweep (shader render, K=4 default)
hybridEnabled=false)DEFAULT_SAFE_MAX_WORKERSDEFAULT_SAFE_MAX_WORKERSPool trace (
HF_SHADER_POOL_TRACE=1) confirms max busyCount climbs from 2 (w=2) → 4 (w=4) → 6 (w=6) instead of staying pinned at 1-2 likebde9b886. All 6 worker_threads now accumulate CPU instead of just worker 0.5× target
Did not land. Best wall = 108s (1.93× vs baseline) vs predicted ~30-50s. The remaining gap is not the shader-blend ceiling anymore —
transitionShaderBlendMssummed / pool size now matches the wall closely. The dominant cost has shifted to DOM-capture serialization through Node's main event loop (Puppeteer CDP dispatch + PNG decode + RGB48 blit are all JS on the single Node thread). Closing that gap is a follow-up effort (move DOM capture / decode off the main thread).Test plan
Unit tests:
partitionTransitionFrames: empty, zero totalFrames, inclusive endpoints, end-of-composition clamp, negative-start clamp, deduplicated overlaps, the issue Shader transitions are much slower than hard cuts despite workers #677 repro shape.shouldUseHybridLayeredPath: SDR multi-worker → true; HDR /workerCount=1/ all-transition /totalFrames=0→ false.estimateCaptureCostMultiplier: 17 % repro ratio (1 < m < 1.5), 0 ratio collapses to baseline, out-of-range ratio falls back to flat+2.distributeLayeredHybridFrameRanges: non-overlapping, contiguous, full-coverage worker slices; transition frames spread across workers; edge cases.shaderTransitionWorkerPool.test.ts(6 tests, added inbde9b886, unchanged here): byte-equivalence with inline path across all 15 shaders, transferList semantics, unknown-shader fallback, concurrent dispatch correctness, clean terminate.The new
captureTransitionFrame/blendTransitionFrameInlinesplit is exercised end-to-end by the empirical validation above. Byte-equivalence with the pre-decoupling path is preserved because:TRANSITIONS[shader]table on both branches.processLayeredTransitionFrameretained ascaptureTransitionFrame + blendTransitionFrameInlinefor the legacy/HDR/sequential path.bun run --filter @hyperframes/producer typecheck: clean.oxlint/oxfmt --check: clean.Lefthook pre-commit (lint + format + typecheck + commitlint): passing.
shouldUseHybridLayeredPath)Closes #677
— Vai