Skip to content

Recover from stale temp file left by crashed writer#11

Merged
softins merged 1 commit into
softins:masterfrom
mcfnord:stale-lock-recovery
May 13, 2026
Merged

Recover from stale temp file left by crashed writer#11
softins merged 1 commit into
softins:masterfrom
mcfnord:stale-lock-recovery

Conversation

@mcfnord
Copy link
Copy Markdown

@mcfnord mcfnord commented Apr 15, 2026

Problem

If the PHP process writing new cache data is killed mid-flight (e.g. a connection reset at the PHP-FPM socket level), register_shutdown_function('cleanup') does not run. This leaves a zero-byte .tmp lock file behind.

All subsequent requests for that endpoint then enter the acquisition loop, fail to open the .tmp file exclusively, and loop in 200ms sleeps until the 20-second timeout fires — returning a non-JSON die() response. Any caller with a shorter timeout (e.g. 10s) sees a 499/503 instead. The endpoint appears permanently broken until the stale file is manually removed.

Fix

After a failed fopen(..., 'x'), check whether the .tmp file is older than 30 seconds. A normal successful fetch completes in well under 2 seconds, so a 30-second threshold only fires on genuinely abandoned locks. When detected, log the event, delete the file, and continue — the next loop iteration acquires the lock cleanly and performs a fresh fetch.

Test plan

  • Confirm fix self-heals an existing stale .tmp file on next request (verified in production: log emitted "Stale lock file detected", fresh data returned in ~1250ms)
  • Confirm normal concurrent requests are unaffected (stale check only runs after fopen fails, and only when file is >30s old)
  • Confirm the 20-second hard timeout remains as a backstop

🤖 Generated with Claude Code

If the PHP process writing new cache data crashes (e.g. connection reset
by peer), register_shutdown_function may not run, leaving a zero-byte
.tmp lock file behind. All subsequent requests then loop waiting for a
cache update that never arrives, timing out after 20s (or sooner if the
caller disconnects).

Detect this by checking whether the .tmp file is older than 30 seconds
(well past the ~1.5s a normal fetch takes) and removing it so the next
waiter can take the lock and complete the fetch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@softins softins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine - many thanks!

@softins softins merged commit 0cff370 into softins:master May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants