Skip to content

Fix GH-8739: OPcache restart corrupts SHM in threaded SAPIs (ZTS)#22281

Open
nacholibre wants to merge 1 commit into
php:PHP-8.4from
nacholibre:zts-opcache-restart-safety
Open

Fix GH-8739: OPcache restart corrupts SHM in threaded SAPIs (ZTS)#22281
nacholibre wants to merge 1 commit into
php:PHP-8.4from
nacholibre:zts-opcache-restart-safety

Conversation

@nacholibre

@nacholibre nacholibre commented Jun 11, 2026

Copy link
Copy Markdown

Summary

In threaded SAPIs (FrankenPHP, mod_php with a threaded MPM), any OPcache restart — out-of-memory, hash overflow, or opcache_reset() — wipes shared memory while other threads are still executing code from it, crashing the whole server with zend_mm_heap corrupted (SIGABRT) or SIGSEGV. This is the ZTS manifestation of GH-8739 (GH-18517 and GH-14471 were closed as duplicates of it; also reported as php/frankenphp#2170, php/frankenphp#1737).

Root cause

accel_is_inactive() gates restart execution on "no in-flight SHM readers". On POSIX this is implemented with fcntl() record locks: readers take F_RDLCK in accel_activate_add(), and the restarting thread probes with F_GETLK.

POSIX record locks are per-process — a process never conflicts with its own locks. When all requests are threads of one process, F_GETLK reports no conflict, accel_is_inactive() always returns true, and the restart memsets ZCSG(hash) and resets the interned-strings buffer under hundreds of live readers.

This is 100 % reproducible: warm cache + ~120 concurrent requests + one opcache_reset() call crashes FrankenPHP within a second (3–4 threads typically report zend_mm_heap corrupted simultaneously). The OOM path crashes identically once accumulated wasted memory exhausts the SHM — for us this killed a production server roughly daily, on every PHP deployment cycle.

The fix

Windows already solved this exact problem: because Windows SAPIs are threaded, accel_is_inactive() there uses an atomic SHM counter (ZCSG(mem_usage)) instead of fcntl. This PR gives POSIX ZTS builds the equivalent:

  • ZCSG(ts_active_requests) — atomic counter of in-flight requests (ZTS non-Windows only).
  • Requests register in RINIT after the restart-pending block (so a thread never blocks its own restart check) and only while ZCG(accelerator_enabled) is true — so once a restart is scheduled, new requests stop registering and the counter drains within the duration of the longest in-flight request, even under saturation.
  • Deregistration in accel_post_deactivate(), paired via a per-thread flag (ZCG(ts_reader_registered)), so the counter stays balanced across early-return paths.
  • accel_is_inactive() returns false while the counter is non-zero.
  • After registering, the request re-checks ZCSG(restart_in_progress) and falls back to uncached execution if it lost the race against an already-running restart.

NTS builds are completely unaffected (no code change compiles in), so php-fpm/prefork keep the fcntl mechanism and its kernel-side crash cleanup. There is no per-request lock added — readers stay lock-free; the cost is one atomic add and one atomic sub per request.

Why not the EBR approach of #21778

I tested #21778 against the reproducer below first: it still crashes identically, because its deferral branch is only reachable as the else of if (accel_is_inactive()) — and under ZTS that check always returns true, so the original unsafe path runs before the new machinery is ever consulted (with opcache.log_verbosity_level=4, Restart Scheduled! and Restarting! appear in the same second with ~120 requests in flight, and the patch's "Deferring opcache restart" message never fires). The epoch/slot machinery also isn't needed for this problem: OPcache already prevents new readers during a pending restart via accelerator_enabled = false, so a plain drain counter is sufficient.

Validation

Real-world workload: FrankenPHP v1.12.2 static (PHP 8.4.20 + this patch), classic mode, 256 threads, Symfony 6.4 production mirror, ~120 concurrent requests replaying live URLs, ~3300 cached scripts:

scenario unpatched with this fix
opcache_reset() under load crash on 1st call (zend_mm_heap corrupted ×3, SIGABRT/SIGSEGV) 5/5 survived; log shows Restart Scheduled! → 1–8 s reader drain → Restarting!
wasted-memory exhaustion (mass invalidation cycles with validate_timestamps=1) crash when free SHM hits 0 clean deferred restart (oom_restarts: 1), cache rebuilds, server healthy

make test is unaffected (the counter path requires ZTS and concurrency; single-threaded .phpt tests cannot exercise the race — the existing behavior of all opcache tests is unchanged). I'm happy to retest revisions against the FrankenPHP reproducer harness — turnaround is ~15 minutes.

Known residual

A nanoseconds-wide TOCTOU remains between a thread snapshotting accelerator_enabled and its registration becoming visible to a concurrently executing restart. The restart_in_progress re-check narrows it, and the same window exists in the battle-tested fcntl protocol used by process-based SAPIs — this patch brings ZTS to parity with that behavior. Closing it completely would require taking the shared-alloc lock in the reader path, which was already rejected in #14803 for performance reasons.

Before executing a scheduled restart (OOM, hash overflow or
opcache_reset()), accel_is_inactive() checks that no request is still
reading shared memory. On POSIX this check is implemented with fcntl()
record locks: readers take an F_RDLCK in accel_activate_add() and the
restarting thread probes for it with F_GETLK.

POSIX record locks are per-process: a process never conflicts with its
own locks. In threaded SAPIs (FrankenPHP, mod_php with a threaded MPM),
all in-flight requests are threads of one process, so F_GETLK never
reports a conflict and accel_is_inactive() always returns true. The
restart then runs zend_accel_hash_clean() and
accel_interned_strings_restore_state() while other threads execute
op_arrays and hold interned string pointers in the wiped memory,
corrupting the heap ("zend_mm_heap corrupted", SIGSEGV/SIGABRT).

Fix it the way ZEND_WIN32 already solved the same problem for threaded
SAPIs on Windows: track in-flight requests with an atomic counter in
SHM. Requests register in RINIT (only while the accelerator is enabled,
so registrations stop as soon as a restart is pending and the counter
drains within the duration of the longest in-flight request) and
deregister in accel_post_deactivate(). accel_is_inactive() refuses to
restart while the counter is non-zero.

The counter is only compiled in for ZTS non-Windows builds; NTS
process-based SAPIs (php-fpm, mod_php prefork) keep using fcntl()
locks, whose kernel-side cleanup on process death they rely on.

Verified on a production Symfony workload under FrankenPHP (256
threads, ~120 concurrent requests): repeated opcache_reset() calls and
wasted-memory-exhaustion restarts, both previously crashing within
seconds, now complete safely after draining readers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant