Skip to content

test_kqueue_pipe_peer_close_uaf hangs at 120s watchdog on Linux backend (master) #170

@arr2036

Description

@arr2036

Repro

test_kqueue_pipe_peer_close_uaf hangs at the 120s watchdog on every Linux CI variant (release-gcc, release-clang, release-musl-gcc, debug-asan, debug-tsan). Reproduces on master (cfe5dd0) without any of the Windows branch changes.

The test (test/kqueue.c:635) is straightforward:

4 worker threads × ~500ms each:
  pipe(p);
  kq = kqueue();
  EV_ADD on p[1] for EVFILT_WRITE;
  close(p[0]); close(p[1]); close(kq);
  // repeat

After joining the workers it calls libkqueue_drain_pending_close() to wait for the monitoring thread to retire every closed kq before the test returns.

CI sample run: https://github.com/arr2036/libkqueue/actions/runs/25345514171 (release-gcc job; same shape on the asan / tsan jobs).

What I think is happening

Two things stack here. The first is now fixed in arr2036/libkqueue@windows-iter-debug and explains roughly half the strands; the second is still open.

(a) Monitoring-thread tid race — fixed in the branch. A previous refactor (b388b1c, "linux/platform: cleanup acquires kq_mtx, drop CANCEL_LOCKED machinery") symmetrised the cleanup handler's lock/unlock for tsan-lockset cleanliness, which moved the kq_mtx unlock point ahead of monitoring_thread_cleanup clearing monitoring_tid. That opened a window where a racing kqueue() could read a still-set monitoring_tid and F_SETOWN_EX its close-detect signal to a TID about to die. The kernel discards thread-directed RT signals when the target exits, so the close-detect signal for that kq was lost and the kq stranded in kq_list; libkqueue_drain_pending_close then spun out its 1M-iteration cap.

A follow-up (67aea58, "linux/platform: drop pthread_detach") papered over a related symptom but didn't address the underlying unlock-too-early move. V1 (pre-b388b1c) held kq_mtx continuously through pthread_detach, making linux_libkqueue_free's tid-read atomic with respect to the detach. The branch restores V1 and re-introduces thread_exit_state so the cleanup handler still knows whether to acquire kq_mtx itself or inherit it from a cancelled context. The tools/tsan.supp suppression (race:monitoring_thread_cleanup, added in 6465a87 + reaffirmed in 3f831b3) covers the lockset asymmetry that motivated the original refactor.

(b) Residual hang — still open. Even with V1 restored, every Linux job in the CI run above still hits the 120s watchdog at this same test. The watchdog's stack-dump tool can't attach under Ubuntu CI's default kernel.yama.ptrace_scope = 1 (eu-stack: dwfl_thread_getframes tid N: Operation not permitted), so I don't have a backtrace yet. It's unclear whether the hang is in libkqueue_drain_pending_close (so still some kq stranded, different cause from (a)) or somewhere else under the worker join.

Asks

  • Reproducer locally with ptrace_scope = 0 so gdb can attach and see which threads are blocked where. The hang is reliable on every Ubuntu 24.04 runner; should reproduce on a developer workstation under the same kernel/glibc.
  • Once the blocked thread is known: is the residual strand a different code path the V1 restoration didn't catch, or is it user-side (e.g. the close(kq) somehow racing the monitoring thread's EVFILT_PROC handling for a child process the test doesn't actually spawn)?

Branch with the V1 restoration

arr2036/libkqueue@windows-iter-debug commit 4bad3d6: arr2036@4bad3d6

(Lives on a Windows-port branch but the linux/platform.c change is self-contained.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions