Skip to content

[DO-NOT-MERGE] Diagnostic: JS-progress watchdog for UnitTests hangs#1681

Draft
bghgary wants to merge 86 commits intoBabylonJS:masterfrom
bghgary:unittest-watchdog
Draft

[DO-NOT-MERGE] Diagnostic: JS-progress watchdog for UnitTests hangs#1681
bghgary wants to merge 86 commits intoBabylonJS:masterfrom
bghgary:unittest-watchdog

Conversation

@bghgary
Copy link
Copy Markdown
Contributor

@bghgary bghgary commented Apr 27, 2026

Status: DO-NOT-MERGE — diagnostic only

[Created by Copilot on behalf of @bghgary]

Verdict

Heartbeat-fed watchdog confirms: slow progress on Ubuntu llvmpipe JSC, not a deadlock.

Run 25026066984: 27✅ 1❌. The lone ❌ is Ubuntu_GCC_JSC failing at Playground rendering with a pixel-diff (First pixel off at 108872 ... Pixel difference: 17308 pixels) — a pre-existing render flake, aborted the job before UnitTests ran. Unrelated to the watchdog.

Ubuntu_Clang_JSC (the previously-failing job) passed JavaScript.All in 141.5 s with 3 PASSED, 4 SKIPPED (ExternalTexture). No watchdog abort. WSL local run: 209 s with [watchdog-heartbeat] pump_frames=... lines visible between mocha output gaps.

Why the previous "60 → 600 s default" wasn't enough

The earlier run on this branch hit a 600-s console-output gap in Materials/Empty-ShaderMaterial and dumped a core. Post-mortem showed gtest in wait_for(16ms), all bgfx threads idle, JS thread blocked in IncrementPendingFrameScopes, DeviceImpl{m_frameBlocked=true, m_pendingFrameScopes=0, m_rendering=false, m_firstFrameStarted=true} — the same quiescent inter-frame snapshot you'd see in either slow-progress or deadlock.

A single core dump can't distinguish the two: the gtest pump loop spends almost all its wall time in wait_for(16ms), so any random-time capture lands there in either case. We needed a tiebreaker — does the render thread actually keep ticking?

Mechanism (this PR)

Two-source watchdog:

  1. Console output activity (mocha lines on stdout). Existing channel.
  2. Render-thread frame progress. g_pumpFrameCount (atomic), incremented after each Start/Finish cycle in the gtest pump loop.

Watchdog thread polls the atomic every 1 s. When it advances, the "last activity" timestamp resets — so a slow-but-progressing test never trips. Every 30 s, the watchdog emits [watchdog-heartbeat] pump_frames=N so we can see progress in the log even during shader-compile silence. If both sources stall for the timeout window, abort with a message that names both.

This means a real deadlock — render thread stuck in bgfx::frame() or FinishRenderingCurrentFrame() — would now trip cleanly, and slow Materials shader compiles do not.

Default restored to 60 s since false positives are eliminated.

Where it changes

  • Apps/UnitTests/Source/Tests.JavaScript.cpp — atomic + watchdog dual-source logic + pump-loop increment.

Follow-ups (other PRs / repos, not this one)

  • Slow llvmpipe shader compile chains in Babylon Materials/* are real and consume ~20–30% of the JSC job budget. See [DO-NOT-MERGE] Partner integration test: #1652 + #1646 #1674 for partner-side discussion. Possible mitigations: skip slowest cases on llvmpipe, bump JSC job timeout, or investigate bgfx-side llvmpipe compile cost.

bghgary and others added 30 commits March 24, 2026 22:05
- Add Tests.ExternalTexture.Render.cpp: end-to-end test that renders a
  texture array through a ShaderMaterial to an external render target,
  verifying each slice (red, green, blue) via pixel readback.
- Add tests.externalTexture.render.ts: JS test with sampler2DArray shader.
- Add RenderDoc.h/cpp to UnitTests for optional GPU capture support.
- Add Utils helpers: CreateTestTextureArrayWithData, CreateRenderTargetTexture,
  ReadBackRenderTarget, DestroyRenderTargetTexture (D3D11, Metal, stubs for
  D3D12/OpenGL).
- Fix ExternalTexture_Shared.h: pass m_impl->NumLayers() instead of
  hardcoded 1 in Attach(), preserving texture array metadata.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove layerIndex parameter from Impl::Update declaration to match
the updated signature in ExternalTexture_Shared.h.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Migrate HeadlessScreenshotApp, StyleTransferApp, and PrecompiledShaderTest
from AddToContextAsync (promise-based) to CreateForJavaScript (synchronous).
This removes the extra frame pump and promise callbacks that were previously
required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llers

- Move RenderDoc.h/cpp to D3D11-only CMake block with HAS_RENDERDOC define
- Guard RenderDoc calls with HAS_RENDERDOC instead of WIN32
- Update StyleTransferApp and PrecompiledShaderTest to use CreateForJavaScript

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On OpenGL, TextureT is unsigned int (not a pointer), so reinterpret_cast
fails. Add NativeHandleToUintPtr helper using if constexpr to handle both
pointer types (D3D11/Metal/D3D12) and integer types (OpenGL).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- RenderDoc.h/cpp now accept void* device instead of ID3D11Device*
- Move RenderDoc to WIN32 block (not D3D11-only) since it works with any API
- Fix OpenGL build: use NativeHandleToUintPtr helper for TextureT cast
- Add Linux support (dlopen librenderdoc.so)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move DestroyTestTexture after FinishRenderingCurrentFrame so bgfx::frame()
processes the texture creation command before the native resource is released.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ensure the JS startup dispatch completes before calling
deviceUpdate.Finish() and FinishRenderingCurrentFrame(). This
guarantees that bgfx::frame() processes the CreateForJavaScript
texture creation command, making the texture available for subsequent
render frames. The old async API had an implicit sync point
(addToContext.wait) that the new sync API lost.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eation

- Rename CreateForJavaScriptWithTextureArray to CreateForJavaScript and
  use arraySize=1 since texture array rendering is covered by
  RenderTextureArray. The old test crashed on CI (STATUS_BREAKPOINT in
  bgfx when creating texture arrays via encoder on WARP).
- Revert two-step create+override approach back to single createTexture2D
  call with _external parameter (overrideInternal from JS thread doesn't
  work since the D3D11 resource isn't created until bgfx::frame).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CreateForJavaScript already exists in Tests.ExternalTexture.cpp,
causing a linker duplicate symbol error.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… test)

The D3D11-specific CreateForJavaScript test crashed on CI due to
bgfx assertions when calling createTexture2D with _external on
the encoder thread. The cross-platform CreateForJavaScript test
in Tests.ExternalTexture.cpp already covers this functionality.
The texture array rendering is covered by RenderTextureArray.

Also revert app startup ordering to Finish->Wait (matching the
pattern used by HeadlessScreenshotApp on master).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The bgfx callback's fatal() handler was silently calling debugBreak()
on DebugCheck assertions with no output, making CI crashes impossible
to diagnose. Now logs the file, line, error code and message to stderr
before breaking, so the assertion details appear in CI logs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add DISM/d3dconfig step to CI to enable D3D debug layer, which will
provide detailed D3D11 validation messages for the CreateShaderResourceView
E_INVALIDARG failure. Kept the _external createTexture2D path (reverted
the AfterRenderScheduler approach) so we can see the actual D3D debug
output that explains the SRV mismatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The bgfx _external texture path triggers E_INVALIDARG in
CreateShaderResourceView on CI's WARP D3D11 driver. The overrideInternal
alternative doesn't support full array textures (hardcodes ArraySize=1).
Since the _external path works on real GPUs, skip the render test on CI
via BABYLON_NATIVE_SKIP_RENDER_TESTS and keep the direct _external path.

Also adds D3D debug layer enablement to CI for future diagnostics, and
logs bgfx fatal errors to stderr before crashing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from _external parameter (crashes on WARP) to create+overrideInternal
two-step approach. The overrideInternal path is compatible with WARP but
sets ArraySize=1 in the SRV, so the RenderTextureArray test (which needs
full array access) is skipped on CI. The render test works on real GPUs
where the _external path succeeds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The overrideInternal call fires on AfterRenderScheduler after the first
bgfx::frame(). An additional frame pump ensures the native texture
backing is applied before the scene render.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove deleted Tests.ExternalTexture.D3D11.cpp from Install/Test/CMakeLists.txt
- Add extra frame pump after CreateForJavaScript in HeadlessScreenshotApp
  and StyleTransferApp so overrideInternal has time to apply the native
  texture backing before the first render.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ompletionScope instead of deferred member release.
…cquired FrameCompletionScope in SubmitCommands.
First CI run with the watchdog confirmed it works as designed: it caught
a 60-second console-output gap inside Materials/Empty-ShaderMaterial and
produced a clean core dump.

Post-mortem analysis of the dump on Ubuntu Clang JSC showed:

- Render thread (gtest) was in the normal wait_for(16ms) inter-frame
  window of the pump loop, not stuck in Finish or dispatcher tick.
- All bgfx renderer / worker threads were idle in cond_wait.
- JS thread was blocked in IncrementPendingFrameScopes because the gate
  was momentarily closed (m_frameBlocked=true, m_pendingFrameScopes=0,
  m_rendering=false), which is the expected state between Finish and
  the next Start. The next Start call (~16ms later) would have unblocked
  it.
- Re-running the same Tests.JavaScript binary locally on WSL Ubuntu
  24.04 + Mesa llvmpipe completed all 3 tests (JavaScript_All,
  ShaderCache.*, UniformPadding.*) successfully in 231 s.

Conclusion: the 60s watchdog was firing on slow-but-progressing shader
compile inside Babylon's Materials test suite on software GL, not on a
real deadlock. Bump the default to 600s so it only fires on genuinely
stuck progress (true hangs) while preserving headroom inside the 30-min
job timeout.

The watchdog itself is still valuable as a guard against real hangs and
as a way to convert opaque job-timeout cancellations into actionable
core dumps; only the threshold needed adjusting.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary and others added 6 commits April 27, 2026 16:58
Previous watchdog only watched JS console output. On Ubuntu llvmpipe, mocha goes silent for many seconds during a single test that compiles many shader variants synchronously inside bgfx::frame(), producing false-positive aborts at 60s and even 600s.

Add a render-thread frame counter (g_pumpFrameCount) incremented each pump iteration. The watchdog thread samples it every 1s; if it advanced, it resets the activity timestamp and emits a [watchdog-heartbeat] line every 30s for diagnosability.

Effect:
- Slow-but-progressing shader compile: pump_frames keeps advancing, watchdog never trips, test completes (209s locally on WSL llvmpipe; ~10x on slowest CI runners may need similar).
- Real deadlock (render thread or scope-acquire stuck): pump_frames freezes, no console activity either, watchdog trips with an actionable signal.

Restored default threshold from 600s back to 60s — false positives are now eliminated by the heartbeat, so we want fast detection of a genuine stall.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…slowness

Past CI runs on this branch show massive variance in `Materials/Empty
ShaderMaterial`: 4.7 s on a fast Linux runner (run 25026066984), > 600 s on
a slow runner (run 25025043245). Same code, same test. Heartbeat watchdog
proved render-thread frames advance during the silence -- not a deadlock --
but doesn't tell us *what within each frame* eats the wall time on slow
runners.

This commit adds per-phase accumulators to the pump loop:

  - g_pumpStartNs   : time inside StartRenderingCurrentFrame()
  - g_pumpFinishNs  : time inside FinishRenderingCurrentFrame()
  - g_pumpWaitNs    : time blocked in exitCodeFuture.wait_for(16ms)

Heartbeat now emits `start_ms / finish_ms / wait_ms` alongside `pump_frames`
every 30 s so a slow run produces a phase-time breakdown directly in CI logs.

If on a slow runner finish_ms dominates start_ms+wait_ms, the cost is in
bgfx::frame() / submit / GL upload. If start_ms dominates, it's RAF / scene
init. If wait_ms dominates relative to frames, JS-side work is blocking.
That data tells us whether to chase shader compile, scene init, llvmpipe
driver work, or a soft livelock -- without guessing.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
We're characterizing runner-variance-driven Materials slowness in Ubuntu_*_JSC. Every other matrix entry adds Actions runtime / cost without contributing data. Restore from master before merging.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Linux JSC UnitTests has been failing on a non-trivial fraction of GitHub-hosted ubuntu-latest runners. Per-phase pump-loop instrumentation on a slow runner showed:

  pump_frames advancing at full ~62 Hz cadence

  start_ms = 16ms / 30000ms (0.05% of wall time)

  finish_ms = 36ms / 30000ms (0.12% of wall time)

  wait_ms  = 29943ms / 30000ms (99.8% of wall time)

i.e., the C++ render thread is essentially idle. The bottleneck is on the JS thread (Babylon.js scene.executeWhenReady polling, gated by JSC executing JS code). Same code, same workflow, different ubuntu-latest runner instances, the same 'Empty ShaderMaterial should compile' test took 4.7s on a fast runner vs 620.2s on a slow runner — 133x runner variance. This is a runner-side problem (likely JSC JIT/GC under noisy-neighbor VM conditions), not our code.

30 min was a tight budget for the full suite even on average runners; on slow runners the suite cannot complete in time. Bump to 60 min so legitimate slow runs finish instead of the job timing out.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Slow-runner characterization complete; restore the full CI matrix so we validate the timeout bump against real conditions on all platforms.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary
Copy link
Copy Markdown
Contributor Author

bghgary commented Apr 28, 2026

Update: CI completion root cause proven

Recommendation: bump Linux job timeout 30→60 min (pushed in 89a04b41). Restored full CI matrix in ee1a6662; validation run in flight.

Data

Per-phase pump-loop instrumentation pushed earlier in this branch produced the following on a slow ubuntu-latest runner (cancelled at ~16 min into UnitTests):

[watchdog-heartbeat] pump_frames=+1859 / 30 s   (≈ max framerate at 16 ms cadence)
[watchdog-heartbeat] start_ms   =+16   / 30000  (0.05 % of wall time)
[watchdog-heartbeat] finish_ms  =+36   / 30000  (0.12 % of wall time)
[watchdog-heartbeat] wait_ms    =+29943 / 30000 (99.8 % of wall time)

Per-test result on the same slow runner:

[log]   Materials
[log]     ✅ Empty ShaderMaterial should compile (620201ms)

That same test ran in 4,658 ms on a fast runner the day before — a 133× variance for the same test, same code, same workflow, different ubuntu-latest VM instance.

What this proves

  • The native render thread is essentially idle during slow runs. bgfx, llvmpipe, and the GL driver are not the bottleneck (start+finish < 0.2 % of wall time).
  • The bottleneck is on the JS thread — Babylon.js's scene.executeWhenReady polling, gated by JSC executing JS code.
  • Cause is almost certainly noisy-neighbor on cheap GitHub-hosted VMs (JSC JIT/GC under contention). Not deadlock, not threading, not bgfx, not llvmpipe shader compile, not our code.
  • Earlier "watchdog firings" on the 60s and 600s defaults were the watchdog correctly catching slow runs that would otherwise have hit the 30-min job timeout. The current heartbeat-fed watchdog (frame-counter tiebreaker) does not fire on slow-but-progressing runs — verified this session: 16 min of slow progress, no false-positive abort.

Fix in this PR

  1. build-linux.yml timeout-minutes: 30 → 60. Slow runners need more headroom; 60 min covers worst observed.
  2. Heartbeat-fed watchdog kept as the real-hang safety net (60 s default, frame-counter tiebreaker, opt-out via BN_WATCHDOG_TIMEOUT_S=0).
  3. Per-phase pump instrumentation kept as a permanent diagnostic — it's how we'll diagnose the next CI regression in seconds instead of days.

Pixel-diff failures

Out of scope for this PR — separate issue, deferred per @bghgary.

[Responded by Copilot on behalf of @bghgary]

bghgary and others added 13 commits April 28, 2026 15:01
60-min timeout bump from previous commit was insufficient: Linux JSC jobs still timed out at 60 min on slow runners. Per-phase pump instrumentation showed pump_frames advancing at full rate but JS thread spending 53+ min in scene.executeWhenReady polling on Materials/Empty ShaderMaterial — vs 4.7s on a fast runner.

Job logs revealed the runner-side cause:

  libEGL warning: DRI3 error: Could not get DRI3 device

  libEGL warning: Ensure your X server supports DRI3 to get accelerated rendering

Some ubuntu-latest VMs cannot initialize DRI3 and silently fall back to a slow software OpenGL path. Different runners hit different fallbacks (swrast vs llvmpipe), producing 100x+ runtime variance for identical code.

Force LIBGL_ALWAYS_SOFTWARE=1 + GALLIUM_DRIVER=llvmpipe so every runner uses the same vectorized software path. Eliminates DRI3 dependency and runner variance.

Also temporarily trim CI matrix to Linux JSC only while validating; restore before merging.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause of the 4-30x Linux JSC CI runtime variance proven on rework-thread-model-perf branch (4 baseline runs hit 30-60min, 4 fix runs ran Unit Tests in 3 sec).

On rework-thread-model the new FrameCompletionScope gate model makes JS-thread SubmitCommands legal only between StartRenderingCurrentFrame and FinishRenderingCurrentFrame. The previous test pump called Start and Finish back-to-back with the wait_for AFTER Finish, collapsing the open window to ~zero. JS-thread work then had to win a scheduler race to land its scope before the pump thread re-acquired the mutex and closed the gate. On contended runners JS lost that race repeatedly, accumulating per-poll 16ms+frame() stalls into 30+ minute runtimes.

Restructure: open the gate before the loop, then for each iteration wait_for(16ms) -> Finish -> Start. Gate is open ~99% of the time so JS thread never has to race.

Validation: rework-thread-model-perf branch (BabylonJS#1652 base + this fix only) ran Unit Tests in 3 seconds across all reachable jobs in 4 parallel runs. The previous baseline showed Unit Tests timing out at 30-60 min on multiple runs.

Note: Validation Tests (Playground pixel diff) is a separate flaky issue unrelated to this fix.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…x CI"

The DRI3/LIBGL hypothesis was wrong. The 4-30x runtime variance is fixed by 140894b (hold frame gate open during pump wait); LIBGL_ALWAYS_SOFTWARE is a no-op for the actual root cause (FrameCompletionScope scheduler race) and adds confusion.

CI matrix kept trimmed to Linux JSC for now (separate decision).

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary
Copy link
Copy Markdown
Contributor Author

bghgary commented Apr 29, 2026

[Responded by Copilot on behalf of @bghgary]

Correction — earlier root-cause comment was wrong.

The DRI3 / LIBGL_ALWAYS_SOFTWARE hypothesis didn't pan out. After comparing master vs rework-thread-model directly, the regression is branch-specific to anything carrying #1652, not a runner-pool issue:

Actual root cause

SafeTimespanGuarantor (per-named-update independent gates) was replaced in #1652 with a single shared mutex+CV+counter FrameCompletionScope gate. JS-thread SubmitCommands and similar must acquire a scope, which blocks on m_frameSyncCV while m_frameBlocked == true.

The unit-test pump was changed to:

while (!ready) {
    Start;        // open gate, notify_all
    Finish;       // wait scopes==0, close gate, bgfx::frame()
    wait_for(16ms);
}

The window between Start and Finish is ~zero. JS thread woken by notify_all has to win a scheduler race against the pump thread re-acquiring the mutex inside Finish to land its scope before the gate closes. On contended runners JS loses that race repeatedly → 16 ms + bgfx::frame() per missed scope → tests that should run 4 s take 30+ min.

Fix

Invert the pump so the gate stays open during wait_for:

Start;
while (true) {
    wait_for(16ms);   // gate OPEN here, JS thread can submit freely
    if (ready) break;
    Finish;
    Start;
}

Validation

Applied the fix on bghgary/unittest-watchdog (commit 140894be). 4 parallel runs × 2 jobs each = 8 jobs:

Run Job Total Unit Tests step
25085490269 gcc 359 s 4 s
25085490269 clang 393 s 4 s
25085504626 gcc 355 s 4 s
25085504626 clang 384 s 4 s
25085506892 gcc 364 s 4 s
25085506892 clang 352 s 4 s
25085508987 gcc 362 s 4 s
25085508987 clang 377 s 4 s

All 4-second Unit Tests, total 352–393 s — back in master baseline range. Variance gone.

Open question

Should the fix live in the test pump only (committed), or also propose a change to the FrameCompletionScope gate model in #1652? Real-world consumers (Playground apps) drive Start/Finish from OS event loops with natural delays, so they likely don't hit this race the way the test pump does. Test-only is small and provably fixes CI; gate-model change is the more architectural fix but is in scope of #1652 itself. Will discuss with @bghgary on his return.

Reverted the unrelated LIBGL_ALWAYS_SOFTWARE env-var fix (it was a no-op for this).

bghgary and others added 7 commits April 29, 2026 15:59
The unit-test pump has been calling Start; Finish back-to-back per
loop iteration with the wait_for AFTER Finish, which collapses the
FrameCompletionScope gate's open window to ~zero. With the new
shared-gate model on rework-thread-model, JS-thread SubmitCommands
(and similar) acquire a scope that blocks on the gate's CV; whenever
they wake on notify_all from Start, they have to win a scheduler race
against the pump thread re-entering Finish to land their scope before
the gate closes. On contended Linux CI runners that race is lost
frequently, accumulating 16ms+ per missed acquisition into 30+ minute
test runtimes (vs ~5 minutes on master).

Restructure the pump to follow the "Start ASAP / Finish; Start tick /
final Finish" pattern: open the frame immediately after device
creation so the JS thread can submit at any time, then in the pump
loop just tick bgfx (Finish; Start) once per iteration, and rely on
the existing trailing Finish (after nativeCanvas.reset()) to close
the gate at shutdown. Drops the dead Start; Finish priming pair and
the redundant post-loop Start.

Validated locally (Win32 RelWithDebInfo): UnitTests --gtest_filter=
JavaScript.* passes in 2.4s with all 24 sub-tests green.

Validated on bghgary fork CI (4 parallel runs x 2 jobs, Linux JSC,
prior version of this fix on bghgary/unittest-watchdog 140894b):
all 8 jobs ran the Unit Tests step in 4 seconds consistently with
total job duration 352-393s — back in master baseline range.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The unit-test pump has been calling Start; Finish back-to-back per
loop iteration with the wait_for AFTER Finish, which collapses the
FrameCompletionScope gate's open window to ~zero. With the new
shared-gate model on rework-thread-model, JS-thread SubmitCommands
(and similar) acquire a scope that blocks on the gate's CV; whenever
they wake on notify_all from Start, they have to win a scheduler race
against the pump thread re-entering Finish to land their scope before
the gate closes. On contended Linux CI runners that race is lost
frequently, accumulating 16ms+ per missed acquisition into 30+ minute
test runtimes (vs ~5 minutes on master).

Restructure the pump to follow the "Start ASAP / Finish; Start tick /
final Finish" pattern: open the frame immediately after device
creation so the JS thread can submit at any time, then in the pump
loop just tick bgfx (Finish; Start) once per iteration, and rely on
the existing trailing Finish (after nativeCanvas.reset()) to close
the gate at shutdown. Drops the dead Start; Finish priming pair and
the redundant post-loop Start.

Validated locally (Win32 RelWithDebInfo): UnitTests --gtest_filter=
JavaScript.* passes in 2.4s with all 24 sub-tests green.

Validated on bghgary fork CI (4 parallel runs x 2 jobs, Linux JSC,
prior version of this fix on bghgary/unittest-watchdog 140894b):
all 8 jobs ran the Unit Tests step in 4 seconds consistently with
total job duration 352-393s — back in master baseline range.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…abylonJS#1646

In BabylonJS#1674's combination of rework-thread-model (BabylonJS#1652) and the
ExternalTexture sync API (BabylonJS#1646), three apps swapped the original
async AddToContextAsync flow for synchronous CreateForJavaScript but
kept the two-frame skeleton from the async version. The result is
that the loader.Dispatch which runs the JS startup() callback now
gets queued in frame 1 even though its observable wait
(startup.get_future().wait()) is in frame 2 — the dispatch can land
in either frame depending on JS-thread scheduling, breaking the
"each Start/Finish pair wraps a phase of work" pattern. On
rework-thread-model the JS thread will block on the closed gate
between frames 1 and 2 anyway, so it functionally works, but the
ordering is wrong.

Move the dispatch into the second frame so the texture creation,
startup() call, and the wait that observes them all run inside the
same frame:

  frame 1: load scripts            (no dispatch)
  frame 2: dispatch[startup] -> wait
  frame 3+: per-asset / render-loop

Three apps had this issue; UnitTests' ExternalTexture tests already
do it correctly. On bare rework-thread-model these apps still use
AddToContextAsync and don't have the bug.

Files:
- Apps/HeadlessScreenshotApp/Win32/App.cpp
- Apps/PrecompiledShaderTest/Source/App.cpp
- Apps/StyleTransferApp/Win32/App.cpp

Local build verification (Win32 RelWithDebInfo):
- HeadlessScreenshotApp: builds cleanly
- StyleTransferApp: builds cleanly
- PrecompiledShaderTest: not configured in this Build/Win32; change
  is structurally identical to HeadlessScreenshotApp which compiled.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	Apps/UnitTests/Source/Tests.JavaScript.cpp
Resolves four real conflicts from the integration of master's recent
canvas/blit fixes (BabylonJS#1683) with the threading-model rework (BabylonJS#1652):

* Core/Graphics/InternalInclude/Babylon/Graphics/DeviceContext.h:
  keep rework's AcquireNewViewId() (no encoder param), add master's
  PeekNextViewId() const.

* Core/Graphics/Source/DeviceImpl.h: same — keep rework's
  AcquireNewViewId / IncrementPendingFrameScopes /
  DecrementPendingFrameScopes / SetActiveEncoder / GetActiveEncoder,
  add master's PeekNextViewId() const.

* Plugins/NativeEngine/Source/NativeEngine.cpp::CopyTexture: adopt
  master's BabylonJS#1683 fix structurally — fetch encoder at top of function,
  schedule blit on PeekNextViewId() so it runs after every view used
  so far (matching bgfx's blit-in-numeric-view-order semantics).
  Substitute rework's GetEncoder() member for master's
  GetUpdateToken().GetEncoder() since UpdateToken is a no-op shim on
  rework.

* Polyfills/Canvas/Source/Context.cpp::Flush: keep both — master's
  EnsureFontsLoaded() at top (CPU-only nanovg font registration, no
  GPU/encoder dependency), then rework's optional FrameCompletionScope
  acquisition for callers that hit Flush outside a frame cycle, then
  GetActiveEncoder + discard.

Local validation: Win32 RelWithDebInfo UnitTests --gtest_filter=
JavaScript.* passes in 2.3s, all 24 sub-tests green.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants