Repro
test_kqueue_pipe_peer_close_uaf hangs at the 120s watchdog on every Linux CI variant (release-gcc, release-clang, release-musl-gcc, debug-asan, debug-tsan). Reproduces on master (cfe5dd0) without any of the Windows branch changes.
The test (test/kqueue.c:635) is straightforward:
4 worker threads × ~500ms each:
pipe(p);
kq = kqueue();
EV_ADD on p[1] for EVFILT_WRITE;
close(p[0]); close(p[1]); close(kq);
// repeat
After joining the workers it calls libkqueue_drain_pending_close() to wait for the monitoring thread to retire every closed kq before the test returns.
CI sample run: https://github.com/arr2036/libkqueue/actions/runs/25345514171 (release-gcc job; same shape on the asan / tsan jobs).
What I think is happening
Two things stack here. The first is now fixed in arr2036/libkqueue@windows-iter-debug and explains roughly half the strands; the second is still open.
(a) Monitoring-thread tid race — fixed in the branch. A previous refactor (b388b1c, "linux/platform: cleanup acquires kq_mtx, drop CANCEL_LOCKED machinery") symmetrised the cleanup handler's lock/unlock for tsan-lockset cleanliness, which moved the kq_mtx unlock point ahead of monitoring_thread_cleanup clearing monitoring_tid. That opened a window where a racing kqueue() could read a still-set monitoring_tid and F_SETOWN_EX its close-detect signal to a TID about to die. The kernel discards thread-directed RT signals when the target exits, so the close-detect signal for that kq was lost and the kq stranded in kq_list; libkqueue_drain_pending_close then spun out its 1M-iteration cap.
A follow-up (67aea58, "linux/platform: drop pthread_detach") papered over a related symptom but didn't address the underlying unlock-too-early move. V1 (pre-b388b1c) held kq_mtx continuously through pthread_detach, making linux_libkqueue_free's tid-read atomic with respect to the detach. The branch restores V1 and re-introduces thread_exit_state so the cleanup handler still knows whether to acquire kq_mtx itself or inherit it from a cancelled context. The tools/tsan.supp suppression (race:monitoring_thread_cleanup, added in 6465a87 + reaffirmed in 3f831b3) covers the lockset asymmetry that motivated the original refactor.
(b) Residual hang — still open. Even with V1 restored, every Linux job in the CI run above still hits the 120s watchdog at this same test. The watchdog's stack-dump tool can't attach under Ubuntu CI's default kernel.yama.ptrace_scope = 1 (eu-stack: dwfl_thread_getframes tid N: Operation not permitted), so I don't have a backtrace yet. It's unclear whether the hang is in libkqueue_drain_pending_close (so still some kq stranded, different cause from (a)) or somewhere else under the worker join.
Asks
- Reproducer locally with
ptrace_scope = 0 so gdb can attach and see which threads are blocked where. The hang is reliable on every Ubuntu 24.04 runner; should reproduce on a developer workstation under the same kernel/glibc.
- Once the blocked thread is known: is the residual strand a different code path the V1 restoration didn't catch, or is it user-side (e.g. the close(kq) somehow racing the monitoring thread's EVFILT_PROC handling for a child process the test doesn't actually spawn)?
Branch with the V1 restoration
arr2036/libkqueue@windows-iter-debug commit 4bad3d6: arr2036@4bad3d6
(Lives on a Windows-port branch but the linux/platform.c change is self-contained.)
Repro
test_kqueue_pipe_peer_close_uafhangs at the 120s watchdog on every Linux CI variant (release-gcc, release-clang, release-musl-gcc, debug-asan, debug-tsan). Reproduces on master (cfe5dd0) without any of the Windows branch changes.The test (test/kqueue.c:635) is straightforward:
After joining the workers it calls
libkqueue_drain_pending_close()to wait for the monitoring thread to retire every closed kq before the test returns.CI sample run: https://github.com/arr2036/libkqueue/actions/runs/25345514171 (release-gcc job; same shape on the asan / tsan jobs).
What I think is happening
Two things stack here. The first is now fixed in
arr2036/libkqueue@windows-iter-debugand explains roughly half the strands; the second is still open.(a) Monitoring-thread tid race — fixed in the branch. A previous refactor (b388b1c, "linux/platform: cleanup acquires kq_mtx, drop CANCEL_LOCKED machinery") symmetrised the cleanup handler's lock/unlock for tsan-lockset cleanliness, which moved the kq_mtx unlock point ahead of
monitoring_thread_cleanupclearingmonitoring_tid. That opened a window where a racingkqueue()could read a still-setmonitoring_tidandF_SETOWN_EXits close-detect signal to a TID about to die. The kernel discards thread-directed RT signals when the target exits, so the close-detect signal for that kq was lost and the kq stranded inkq_list;libkqueue_drain_pending_closethen spun out its 1M-iteration cap.A follow-up (67aea58, "linux/platform: drop pthread_detach") papered over a related symptom but didn't address the underlying unlock-too-early move. V1 (pre-b388b1c) held kq_mtx continuously through
pthread_detach, makinglinux_libkqueue_free's tid-read atomic with respect to the detach. The branch restores V1 and re-introducesthread_exit_stateso the cleanup handler still knows whether to acquire kq_mtx itself or inherit it from a cancelled context. The tools/tsan.supp suppression (race:monitoring_thread_cleanup, added in 6465a87 + reaffirmed in 3f831b3) covers the lockset asymmetry that motivated the original refactor.(b) Residual hang — still open. Even with V1 restored, every Linux job in the CI run above still hits the 120s watchdog at this same test. The watchdog's stack-dump tool can't attach under Ubuntu CI's default
kernel.yama.ptrace_scope = 1(eu-stack: dwfl_thread_getframes tid N: Operation not permitted), so I don't have a backtrace yet. It's unclear whether the hang is inlibkqueue_drain_pending_close(so still some kq stranded, different cause from (a)) or somewhere else under the worker join.Asks
ptrace_scope = 0so gdb can attach and see which threads are blocked where. The hang is reliable on every Ubuntu 24.04 runner; should reproduce on a developer workstation under the same kernel/glibc.Branch with the V1 restoration
arr2036/libkqueue@windows-iter-debugcommit4bad3d6: arr2036@4bad3d6(Lives on a Windows-port branch but the linux/platform.c change is self-contained.)