[DO-NOT-MERGE] Diagnostic: JS-progress watchdog for UnitTests hangs by bghgary · Pull Request #1681 · BabylonJS/BabylonNative

bghgary · 2026-04-27T22:51:35Z

Status: DO-NOT-MERGE — diagnostic only

[Created by Copilot on behalf of @bghgary]

Verdict

Heartbeat-fed watchdog confirms: slow progress on Ubuntu llvmpipe JSC, not a deadlock.

Run 25026066984: 27✅ 1❌. The lone ❌ is Ubuntu_GCC_JSC failing at Playground rendering with a pixel-diff (First pixel off at 108872 ... Pixel difference: 17308 pixels) — a pre-existing render flake, aborted the job before UnitTests ran. Unrelated to the watchdog.

Ubuntu_Clang_JSC (the previously-failing job) passed JavaScript.All in 141.5 s with 3 PASSED, 4 SKIPPED (ExternalTexture). No watchdog abort. WSL local run: 209 s with [watchdog-heartbeat] pump_frames=... lines visible between mocha output gaps.

Why the previous "60 → 600 s default" wasn't enough

The earlier run on this branch hit a 600-s console-output gap in Materials/Empty-ShaderMaterial and dumped a core. Post-mortem showed gtest in wait_for(16ms), all bgfx threads idle, JS thread blocked in IncrementPendingFrameScopes, DeviceImpl{m_frameBlocked=true, m_pendingFrameScopes=0, m_rendering=false, m_firstFrameStarted=true} — the same quiescent inter-frame snapshot you'd see in either slow-progress or deadlock.

A single core dump can't distinguish the two: the gtest pump loop spends almost all its wall time in wait_for(16ms), so any random-time capture lands there in either case. We needed a tiebreaker — does the render thread actually keep ticking?

Mechanism (this PR)

Two-source watchdog:

Console output activity (mocha lines on stdout). Existing channel.
Render-thread frame progress. g_pumpFrameCount (atomic), incremented after each Start/Finish cycle in the gtest pump loop.

Watchdog thread polls the atomic every 1 s. When it advances, the "last activity" timestamp resets — so a slow-but-progressing test never trips. Every 30 s, the watchdog emits [watchdog-heartbeat] pump_frames=N so we can see progress in the log even during shader-compile silence. If both sources stall for the timeout window, abort with a message that names both.

This means a real deadlock — render thread stuck in bgfx::frame() or FinishRenderingCurrentFrame() — would now trip cleanly, and slow Materials shader compiles do not.

Default restored to 60 s since false positives are eliminated.

Where it changes

Apps/UnitTests/Source/Tests.JavaScript.cpp — atomic + watchdog dual-source logic + pump-loop increment.

Follow-ups (other PRs / repos, not this one)

Slow llvmpipe shader compile chains in Babylon Materials/* are real and consume ~20–30% of the JSC job budget. See [DO-NOT-MERGE] Partner integration test: #1652 + #1646 #1674 for partner-side discussion. Possible mitigations: skip slowest cases on llvmpipe, bump JSC job timeout, or investigate bgfx-side llvmpipe compile cost.

- Add Tests.ExternalTexture.Render.cpp: end-to-end test that renders a texture array through a ShaderMaterial to an external render target, verifying each slice (red, green, blue) via pixel readback. - Add tests.externalTexture.render.ts: JS test with sampler2DArray shader. - Add RenderDoc.h/cpp to UnitTests for optional GPU capture support. - Add Utils helpers: CreateTestTextureArrayWithData, CreateRenderTargetTexture, ReadBackRenderTarget, DestroyRenderTargetTexture (D3D11, Metal, stubs for D3D12/OpenGL). - Fix ExternalTexture_Shared.h: pass m_impl->NumLayers() instead of hardcoded 1 in Attach(), preserving texture array metadata. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove layerIndex parameter from Impl::Update declaration to match the updated signature in ExternalTexture_Shared.h. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Migrate HeadlessScreenshotApp, StyleTransferApp, and PrecompiledShaderTest from AddToContextAsync (promise-based) to CreateForJavaScript (synchronous). This removes the extra frame pump and promise callbacks that were previously required. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…llers - Move RenderDoc.h/cpp to D3D11-only CMake block with HAS_RENDERDOC define - Guard RenderDoc calls with HAS_RENDERDOC instead of WIN32 - Update StyleTransferApp and PrecompiledShaderTest to use CreateForJavaScript Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

On OpenGL, TextureT is unsigned int (not a pointer), so reinterpret_cast fails. Add NativeHandleToUintPtr helper using if constexpr to handle both pointer types (D3D11/Metal/D3D12) and integer types (OpenGL). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- RenderDoc.h/cpp now accept void* device instead of ID3D11Device* - Move RenderDoc to WIN32 block (not D3D11-only) since it works with any API - Fix OpenGL build: use NativeHandleToUintPtr helper for TextureT cast - Add Linux support (dlopen librenderdoc.so) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move DestroyTestTexture after FinishRenderingCurrentFrame so bgfx::frame() processes the texture creation command before the native resource is released. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Ensure the JS startup dispatch completes before calling deviceUpdate.Finish() and FinishRenderingCurrentFrame(). This guarantees that bgfx::frame() processes the CreateForJavaScript texture creation command, making the texture available for subsequent render frames. The old async API had an implicit sync point (addToContext.wait) that the new sync API lost. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…eation - Rename CreateForJavaScriptWithTextureArray to CreateForJavaScript and use arraySize=1 since texture array rendering is covered by RenderTextureArray. The old test crashed on CI (STATUS_BREAKPOINT in bgfx when creating texture arrays via encoder on WARP). - Revert two-step create+override approach back to single createTexture2D call with _external parameter (overrideInternal from JS thread doesn't work since the D3D11 resource isn't created until bgfx::frame). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CreateForJavaScript already exists in Tests.ExternalTexture.cpp, causing a linker duplicate symbol error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… test) The D3D11-specific CreateForJavaScript test crashed on CI due to bgfx assertions when calling createTexture2D with _external on the encoder thread. The cross-platform CreateForJavaScript test in Tests.ExternalTexture.cpp already covers this functionality. The texture array rendering is covered by RenderTextureArray. Also revert app startup ordering to Finish->Wait (matching the pattern used by HeadlessScreenshotApp on master). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The bgfx callback's fatal() handler was silently calling debugBreak() on DebugCheck assertions with no output, making CI crashes impossible to diagnose. Now logs the file, line, error code and message to stderr before breaking, so the assertion details appear in CI logs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add DISM/d3dconfig step to CI to enable D3D debug layer, which will provide detailed D3D11 validation messages for the CreateShaderResourceView E_INVALIDARG failure. Kept the _external createTexture2D path (reverted the AfterRenderScheduler approach) so we can see the actual D3D debug output that explains the SRV mismatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The bgfx _external texture path triggers E_INVALIDARG in CreateShaderResourceView on CI's WARP D3D11 driver. The overrideInternal alternative doesn't support full array textures (hardcodes ArraySize=1). Since the _external path works on real GPUs, skip the render test on CI via BABYLON_NATIVE_SKIP_RENDER_TESTS and keep the direct _external path. Also adds D3D debug layer enablement to CI for future diagnostics, and logs bgfx fatal errors to stderr before crashing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Switch from _external parameter (crashes on WARP) to create+overrideInternal two-step approach. The overrideInternal path is compatible with WARP but sets ArraySize=1 in the SRV, so the RenderTextureArray test (which needs full array access) is skipped on CI. The render test works on real GPUs where the _external path succeeds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The overrideInternal call fires on AfterRenderScheduler after the first bgfx::frame(). An additional frame pump ensures the native texture backing is applied before the scene render. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Remove deleted Tests.ExternalTexture.D3D11.cpp from Install/Test/CMakeLists.txt - Add extra frame pump after CreateForJavaScript in HeadlessScreenshotApp and StyleTransferApp so overrideInternal has time to apply the native texture backing before the first render. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…er FrameCompletionScope acquisition.

…ompletionScope instead of deferred member release.

…der race.

This reverts commit 0d315bb.

…cquired FrameCompletionScope in SubmitCommands.

First CI run with the watchdog confirmed it works as designed: it caught a 60-second console-output gap inside Materials/Empty-ShaderMaterial and produced a clean core dump. Post-mortem analysis of the dump on Ubuntu Clang JSC showed: - Render thread (gtest) was in the normal wait_for(16ms) inter-frame window of the pump loop, not stuck in Finish or dispatcher tick. - All bgfx renderer / worker threads were idle in cond_wait. - JS thread was blocked in IncrementPendingFrameScopes because the gate was momentarily closed (m_frameBlocked=true, m_pendingFrameScopes=0, m_rendering=false), which is the expected state between Finish and the next Start. The next Start call (~16ms later) would have unblocked it. - Re-running the same Tests.JavaScript binary locally on WSL Ubuntu 24.04 + Mesa llvmpipe completed all 3 tests (JavaScript_All, ShaderCache.*, UniformPadding.*) successfully in 231 s. Conclusion: the 60s watchdog was firing on slow-but-progressing shader compile inside Babylon's Materials test suite on software GL, not on a real deadlock. Bump the default to 600s so it only fires on genuinely stuck progress (true hangs) while preserving headroom inside the 30-min job timeout. The watchdog itself is still valuable as a guard against real hangs and as a way to convert opaque job-timeout cancellations into actionable core dumps; only the threshold needed adjusting. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Previous watchdog only watched JS console output. On Ubuntu llvmpipe, mocha goes silent for many seconds during a single test that compiles many shader variants synchronously inside bgfx::frame(), producing false-positive aborts at 60s and even 600s. Add a render-thread frame counter (g_pumpFrameCount) incremented each pump iteration. The watchdog thread samples it every 1s; if it advanced, it resets the activity timestamp and emits a [watchdog-heartbeat] line every 30s for diagnosability. Effect: - Slow-but-progressing shader compile: pump_frames keeps advancing, watchdog never trips, test completes (209s locally on WSL llvmpipe; ~10x on slowest CI runners may need similar). - Real deadlock (render thread or scope-acquire stuck): pump_frames freezes, no console activity either, watchdog trips with an actionable signal. Restored default threshold from 600s back to 60s — false positives are now eliminated by the heartbeat, so we want fast detection of a genuine stall. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@bghgary

…slowness Past CI runs on this branch show massive variance in `Materials/Empty ShaderMaterial`: 4.7 s on a fast Linux runner (run 25026066984), > 600 s on a slow runner (run 25025043245). Same code, same test. Heartbeat watchdog proved render-thread frames advance during the silence -- not a deadlock -- but doesn't tell us *what within each frame* eats the wall time on slow runners. This commit adds per-phase accumulators to the pump loop: - g_pumpStartNs : time inside StartRenderingCurrentFrame() - g_pumpFinishNs : time inside FinishRenderingCurrentFrame() - g_pumpWaitNs : time blocked in exitCodeFuture.wait_for(16ms) Heartbeat now emits `start_ms / finish_ms / wait_ms` alongside `pump_frames` every 30 s so a slow run produces a phase-time breakdown directly in CI logs. If on a slow runner finish_ms dominates start_ms+wait_ms, the cost is in bgfx::frame() / submit / GL upload. If start_ms dominates, it's RAF / scene init. If wait_ms dominates relative to frames, JS-side work is blocking. That data tells us whether to chase shader compile, scene init, llvmpipe driver work, or a soft livelock -- without guessing. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@bghgary

We're characterizing runner-variance-driven Materials slowness in Ubuntu_*_JSC. Every other matrix entry adds Actions runtime / cost without contributing data. Restore from master before merging. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@bghgary

Linux JSC UnitTests has been failing on a non-trivial fraction of GitHub-hosted ubuntu-latest runners. Per-phase pump-loop instrumentation on a slow runner showed: pump_frames advancing at full ~62 Hz cadence start_ms = 16ms / 30000ms (0.05% of wall time) finish_ms = 36ms / 30000ms (0.12% of wall time) wait_ms = 29943ms / 30000ms (99.8% of wall time) i.e., the C++ render thread is essentially idle. The bottleneck is on the JS thread (Babylon.js scene.executeWhenReady polling, gated by JSC executing JS code). Same code, same workflow, different ubuntu-latest runner instances, the same 'Empty ShaderMaterial should compile' test took 4.7s on a fast runner vs 620.2s on a slow runner — 133x runner variance. This is a runner-side problem (likely JSC JIT/GC under noisy-neighbor VM conditions), not our code. 30 min was a tight budget for the full suite even on average runners; on slow runners the suite cannot complete in time. Bump to 60 min so legitimate slow runs finish instead of the job timing out. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@bghgary

Slow-runner characterization complete; restore the full CI matrix so we validate the timeout bump against real conditions on all platforms. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bghgary · 2026-04-28T20:11:03Z

Update: CI completion root cause proven

Recommendation: bump Linux job timeout 30→60 min (pushed in 89a04b41). Restored full CI matrix in ee1a6662; validation run in flight.

Data

Per-phase pump-loop instrumentation pushed earlier in this branch produced the following on a slow ubuntu-latest runner (cancelled at ~16 min into UnitTests):

[watchdog-heartbeat] pump_frames=+1859 / 30 s   (≈ max framerate at 16 ms cadence)
[watchdog-heartbeat] start_ms   =+16   / 30000  (0.05 % of wall time)
[watchdog-heartbeat] finish_ms  =+36   / 30000  (0.12 % of wall time)
[watchdog-heartbeat] wait_ms    =+29943 / 30000 (99.8 % of wall time)

Per-test result on the same slow runner:

[log]   Materials
[log]     ✅ Empty ShaderMaterial should compile (620201ms)

That same test ran in 4,658 ms on a fast runner the day before — a 133× variance for the same test, same code, same workflow, different ubuntu-latest VM instance.

What this proves

The native render thread is essentially idle during slow runs. bgfx, llvmpipe, and the GL driver are not the bottleneck (start+finish < 0.2 % of wall time).
The bottleneck is on the JS thread — Babylon.js's scene.executeWhenReady polling, gated by JSC executing JS code.
Cause is almost certainly noisy-neighbor on cheap GitHub-hosted VMs (JSC JIT/GC under contention). Not deadlock, not threading, not bgfx, not llvmpipe shader compile, not our code.
Earlier "watchdog firings" on the 60s and 600s defaults were the watchdog correctly catching slow runs that would otherwise have hit the 30-min job timeout. The current heartbeat-fed watchdog (frame-counter tiebreaker) does not fire on slow-but-progressing runs — verified this session: 16 min of slow progress, no false-positive abort.

Fix in this PR

build-linux.yml timeout-minutes: 30 → 60. Slow runners need more headroom; 60 min covers worst observed.
Heartbeat-fed watchdog kept as the real-hang safety net (60 s default, frame-counter tiebreaker, opt-out via BN_WATCHDOG_TIMEOUT_S=0).
Per-phase pump instrumentation kept as a permanent diagnostic — it's how we'll diagnose the next CI regression in seconds instead of days.

Pixel-diff failures

Out of scope for this PR — separate issue, deferred per @bghgary.

[Responded by Copilot on behalf of @bghgary]

@bghgary

60-min timeout bump from previous commit was insufficient: Linux JSC jobs still timed out at 60 min on slow runners. Per-phase pump instrumentation showed pump_frames advancing at full rate but JS thread spending 53+ min in scene.executeWhenReady polling on Materials/Empty ShaderMaterial — vs 4.7s on a fast runner. Job logs revealed the runner-side cause: libEGL warning: DRI3 error: Could not get DRI3 device libEGL warning: Ensure your X server supports DRI3 to get accelerated rendering Some ubuntu-latest VMs cannot initialize DRI3 and silently fall back to a slow software OpenGL path. Different runners hit different fallbacks (swrast vs llvmpipe), producing 100x+ runtime variance for identical code. Force LIBGL_ALWAYS_SOFTWARE=1 + GALLIUM_DRIVER=llvmpipe so every runner uses the same vectorized software path. Eliminates DRI3 dependency and runner variance. Also temporarily trim CI matrix to Linux JSC only while validating; restore before merging. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@bghgary

Root cause of the 4-30x Linux JSC CI runtime variance proven on rework-thread-model-perf branch (4 baseline runs hit 30-60min, 4 fix runs ran Unit Tests in 3 sec). On rework-thread-model the new FrameCompletionScope gate model makes JS-thread SubmitCommands legal only between StartRenderingCurrentFrame and FinishRenderingCurrentFrame. The previous test pump called Start and Finish back-to-back with the wait_for AFTER Finish, collapsing the open window to ~zero. JS-thread work then had to win a scheduler race to land its scope before the pump thread re-acquired the mutex and closed the gate. On contended runners JS lost that race repeatedly, accumulating per-poll 16ms+frame() stalls into 30+ minute runtimes. Restructure: open the gate before the loop, then for each iteration wait_for(16ms) -> Finish -> Start. Gate is open ~99% of the time so JS thread never has to race. Validation: rework-thread-model-perf branch (BabylonJS#1652 base + this fix only) ran Unit Tests in 3 seconds across all reachable jobs in 4 parallel runs. The previous baseline showed Unit Tests timing out at 30-60 min on multiple runs. Note: Validation Tests (Playground pixel diff) is a separate flaky issue unrelated to this fix. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@bghgary

…x CI" The DRI3/LIBGL hypothesis was wrong. The 4-30x runtime variance is fixed by 140894b (hold frame gate open during pump wait); LIBGL_ALWAYS_SOFTWARE is a no-op for the actual root cause (FrameCompletionScope scheduler race) and adds confusion. CI matrix kept trimmed to Linux JSC for now (separate decision). [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bghgary · 2026-04-29T01:04:58Z

[Responded by Copilot on behalf of @bghgary]

Correction — earlier root-cause comment was wrong.

The DRI3 / LIBGL_ALWAYS_SOFTWARE hypothesis didn't pan out. After comparing master vs rework-thread-model directly, the regression is branch-specific to anything carrying #1652, not a runner-pool issue:

Master Linux JSC over the last 12 days: 280–470 s, 0 timeouts across 40 jobs.
rework-thread-model (bare, no Add synchronous ExternalTexture.CreateForJavaScript and deprecate AddToContextAsync #1646, no watchdog): 5–30+ min, hits 30/60-min timeouts on contended runners.

Actual root cause

SafeTimespanGuarantor (per-named-update independent gates) was replaced in #1652 with a single shared mutex+CV+counter FrameCompletionScope gate. JS-thread SubmitCommands and similar must acquire a scope, which blocks on m_frameSyncCV while m_frameBlocked == true.

The unit-test pump was changed to:

while (!ready) {
    Start;        // open gate, notify_all
    Finish;       // wait scopes==0, close gate, bgfx::frame()
    wait_for(16ms);
}

The window between Start and Finish is ~zero. JS thread woken by notify_all has to win a scheduler race against the pump thread re-acquiring the mutex inside Finish to land its scope before the gate closes. On contended runners JS loses that race repeatedly → 16 ms + bgfx::frame() per missed scope → tests that should run 4 s take 30+ min.

Fix

Invert the pump so the gate stays open during wait_for:

Start;
while (true) {
    wait_for(16ms);   // gate OPEN here, JS thread can submit freely
    if (ready) break;
    Finish;
    Start;
}

Validation

Applied the fix on bghgary/unittest-watchdog (commit 140894be). 4 parallel runs × 2 jobs each = 8 jobs:

Run	Job	Total	Unit Tests step
25085490269	gcc	359 s	4 s
25085490269	clang	393 s	4 s
25085504626	gcc	355 s	4 s
25085504626	clang	384 s	4 s
25085506892	gcc	364 s	4 s
25085506892	clang	352 s	4 s
25085508987	gcc	362 s	4 s
25085508987	clang	377 s	4 s

All 4-second Unit Tests, total 352–393 s — back in master baseline range. Variance gone.

Open question

Should the fix live in the test pump only (committed), or also propose a change to the FrameCompletionScope gate model in #1652? Real-world consumers (Playground apps) drive Start/Finish from OS event loops with natural delays, so they likely don't hit this race the way the test pump does. Test-only is small and provably fixes CI; gate-model change is the more architectural fix but is in scope of #1652 itself. Will discuss with @bghgary on his return.

Reverted the unrelated LIBGL_ALWAYS_SOFTWARE env-var fix (it was a no-op for this).

@bghgary

The unit-test pump has been calling Start; Finish back-to-back per loop iteration with the wait_for AFTER Finish, which collapses the FrameCompletionScope gate's open window to ~zero. With the new shared-gate model on rework-thread-model, JS-thread SubmitCommands (and similar) acquire a scope that blocks on the gate's CV; whenever they wake on notify_all from Start, they have to win a scheduler race against the pump thread re-entering Finish to land their scope before the gate closes. On contended Linux CI runners that race is lost frequently, accumulating 16ms+ per missed acquisition into 30+ minute test runtimes (vs ~5 minutes on master). Restructure the pump to follow the "Start ASAP / Finish; Start tick / final Finish" pattern: open the frame immediately after device creation so the JS thread can submit at any time, then in the pump loop just tick bgfx (Finish; Start) once per iteration, and rely on the existing trailing Finish (after nativeCanvas.reset()) to close the gate at shutdown. Drops the dead Start; Finish priming pair and the redundant post-loop Start. Validated locally (Win32 RelWithDebInfo): UnitTests --gtest_filter= JavaScript.* passes in 2.4s with all 24 sub-tests green. Validated on bghgary fork CI (4 parallel runs x 2 jobs, Linux JSC, prior version of this fix on bghgary/unittest-watchdog 140894b): all 8 jobs ran the Unit Tests step in 4 seconds consistently with total job duration 352-393s — back in master baseline range. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@bghgary

The unit-test pump has been calling Start; Finish back-to-back per loop iteration with the wait_for AFTER Finish, which collapses the FrameCompletionScope gate's open window to ~zero. With the new shared-gate model on rework-thread-model, JS-thread SubmitCommands (and similar) acquire a scope that blocks on the gate's CV; whenever they wake on notify_all from Start, they have to win a scheduler race against the pump thread re-entering Finish to land their scope before the gate closes. On contended Linux CI runners that race is lost frequently, accumulating 16ms+ per missed acquisition into 30+ minute test runtimes (vs ~5 minutes on master). Restructure the pump to follow the "Start ASAP / Finish; Start tick / final Finish" pattern: open the frame immediately after device creation so the JS thread can submit at any time, then in the pump loop just tick bgfx (Finish; Start) once per iteration, and rely on the existing trailing Finish (after nativeCanvas.reset()) to close the gate at shutdown. Drops the dead Start; Finish priming pair and the redundant post-loop Start. Validated locally (Win32 RelWithDebInfo): UnitTests --gtest_filter= JavaScript.* passes in 2.4s with all 24 sub-tests green. Validated on bghgary fork CI (4 parallel runs x 2 jobs, Linux JSC, prior version of this fix on bghgary/unittest-watchdog 140894b): all 8 jobs ran the Unit Tests step in 4 seconds consistently with total job duration 352-393s — back in master baseline range. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@bghgary

…abylonJS#1646 In BabylonJS#1674's combination of rework-thread-model (BabylonJS#1652) and the ExternalTexture sync API (BabylonJS#1646), three apps swapped the original async AddToContextAsync flow for synchronous CreateForJavaScript but kept the two-frame skeleton from the async version. The result is that the loader.Dispatch which runs the JS startup() callback now gets queued in frame 1 even though its observable wait (startup.get_future().wait()) is in frame 2 — the dispatch can land in either frame depending on JS-thread scheduling, breaking the "each Start/Finish pair wraps a phase of work" pattern. On rework-thread-model the JS thread will block on the closed gate between frames 1 and 2 anyway, so it functionally works, but the ordering is wrong. Move the dispatch into the second frame so the texture creation, startup() call, and the wait that observes them all run inside the same frame: frame 1: load scripts (no dispatch) frame 2: dispatch[startup] -> wait frame 3+: per-asset / render-loop Three apps had this issue; UnitTests' ExternalTexture tests already do it correctly. On bare rework-thread-model these apps still use AddToContextAsync and don't have the bug. Files: - Apps/HeadlessScreenshotApp/Win32/App.cpp - Apps/PrecompiledShaderTest/Source/App.cpp - Apps/StyleTransferApp/Win32/App.cpp Local build verification (Win32 RelWithDebInfo): - HeadlessScreenshotApp: builds cleanly - StyleTransferApp: builds cleanly - PrecompiledShaderTest: not configured in this Build/Win32; change is structurally identical to HeadlessScreenshotApp which compiled. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

# Conflicts: # Apps/UnitTests/Source/Tests.JavaScript.cpp

@bghgary

Resolves four real conflicts from the integration of master's recent canvas/blit fixes (BabylonJS#1683) with the threading-model rework (BabylonJS#1652): * Core/Graphics/InternalInclude/Babylon/Graphics/DeviceContext.h: keep rework's AcquireNewViewId() (no encoder param), add master's PeekNextViewId() const. * Core/Graphics/Source/DeviceImpl.h: same — keep rework's AcquireNewViewId / IncrementPendingFrameScopes / DecrementPendingFrameScopes / SetActiveEncoder / GetActiveEncoder, add master's PeekNextViewId() const. * Plugins/NativeEngine/Source/NativeEngine.cpp::CopyTexture: adopt master's BabylonJS#1683 fix structurally — fetch encoder at top of function, schedule blit on PeekNextViewId() so it runs after every view used so far (matching bgfx's blit-in-numeric-view-order semantics). Substitute rework's GetEncoder() member for master's GetUpdateToken().GetEncoder() since UpdateToken is a no-op shim on rework. * Polyfills/Canvas/Source/Context.cpp::Flush: keep both — master's EnsureFontsLoaded() at top (CPU-only nanovg font registration, no GPU/encoder dependency), then rework's optional FrameCompletionScope acquisition for callers that hit Flush outside a frame cycle, then GetActiveEncoder + discard. Local validation: Win32 RelWithDebInfo UnitTests --gtest_filter= JavaScript.* passes in 2.3s, all 24 sub-tests green. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bghgary and others added 30 commits March 24, 2026 22:05

Fix Update signature on Metal, D3D12, and OpenGL backends

117f588

Remove layerIndex parameter from Impl::Update declaration to match the updated signature in ExternalTexture_Shared.h. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove accidentally committed test artifact

e103f3e

Add shaderCache.bin to .gitignore

a1e534c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix duplicate test name: rename to CreateForJavaScriptD3D11

752cfb8

CreateForJavaScript already exists in Tests.ExternalTexture.cpp, causing a linker duplicate symbol error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reworked threading model.

ca85709

Added first frame started check.

9daf790

Fixed ordering.

9fa86ff

Capture just POD values.

55712c8

Removed unused captures.

b80caa4

Fix null encoder crash in Canvas::Flush by adding defensive check aft…

a2b1679

…er FrameCompletionScope acquisition.

Fix LoadCubeTexture*.

05e682b

Fix null encoder crash in SubmitCommands by using stack-scoped FrameC…

212eb4a

…ompletionScope instead of deferred member release.

Always acquire FrameCompletionScope in SubmitCommands to prevent enco…

ed656eb

…der race.

Revert "Fix LoadCubeTexture*."

8cb30d5

This reverts commit 0d315bb.

Pump frames in JavaScript unit test to prevent deadlock from always-a…

3d491d8

…cquired FrameCompletionScope in SubmitCommands.

bghgary mentioned this pull request Apr 27, 2026

[DO-NOT-MERGE] Partner integration test: #1652 + #1646 #1674

Draft

bghgary and others added 6 commits April 27, 2026 16:58

Re-run CI to characterize Linux JSC runner variance

e9f984c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bghgary and others added 13 commits April 28, 2026 15:01

ci: variance sample 1

7f85d3d

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: variance sample 2

fbff37a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: variance sample 3

a7f71ed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: trigger CI on unittest-watchdog branch pushes

1afd50a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: variance sample 1

49ba437

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: variance sample 2

295188b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: variance sample 3

3b35c9e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: variance sample 4

e43d71b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: variance sample 5

6d5cf40

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: variance sample 6

838684a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bghgary and others added 7 commits April 29, 2026 15:59

Merge partner-test-1652-1646: v2 pump fix + dispatch reorders

cc91fcf

# Conflicts: # Apps/UnitTests/Source/Tests.JavaScript.cpp

Merge rework-thread-model-pump-fix: master integration + pump fix

1804e4a

Merge partner-test-1652-1646: master integration

5a596fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO-NOT-MERGE] Diagnostic: JS-progress watchdog for UnitTests hangs#1681

[DO-NOT-MERGE] Diagnostic: JS-progress watchdog for UnitTests hangs#1681
bghgary wants to merge 86 commits intoBabylonJS:masterfrom
bghgary:unittest-watchdog

bghgary commented Apr 27, 2026 •

edited

Loading

Uh oh!

bghgary commented Apr 28, 2026

Uh oh!

bghgary commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bghgary commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status: DO-NOT-MERGE — diagnostic only

Verdict

Why the previous "60 → 600 s default" wasn't enough

Mechanism (this PR)

Where it changes

Follow-ups (other PRs / repos, not this one)

Uh oh!

bghgary commented Apr 28, 2026

Update: CI completion root cause proven

Data

What this proves

Fix in this PR

Pixel-diff failures

Uh oh!

bghgary commented Apr 29, 2026

Actual root cause

Fix

Validation

Open question

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bghgary commented Apr 27, 2026 •

edited

Loading