Skip to content

Fix #189: process worker fork-safety on macOS Tahoe#207

Open
brentrager wants to merge 3 commits into
Maxteabag:mainfrom
brentrager:fix/macos-tahoe-fork-segfault
Open

Fix #189: process worker fork-safety on macOS Tahoe#207
brentrager wants to merge 3 commits into
Maxteabag:mainfrom
brentrager:fix/macos-tahoe-fork-segfault

Conversation

@brentrager
Copy link
Copy Markdown

@brentrager brentrager commented May 20, 2026

Thanks again for sqlit — it's the cleanest of the SQL TUIs I've tried.

Fixes #189. macOS Tahoe (26.4+) widened the set of system libraries that abort when initialized after fork(), which has been quietly breaking quite a few Python projects (celery, gunicorn, psycopg2 all have open issues from the last few months — links below). For sqlit, the symptom is that the process worker either hangs with Worker connection closed. or segfaults the child when expanding a folder or running a query.

This PR has been split per your request (#207 review) — the psycopg v3 driver swap and its supporting adapter/test changes now live in #222.

Commits

  1. Drop the spawn → fork fallback on darwin; redirect worker stderr. When multiprocessing.spawn fails with bad value(s) in fds_to_keep, the current code falls back to fork(). On Tahoe that's no longer safe — CoreFoundation, Security.framework, libsystem_trace's os_log, and libsystem_info's getaddrinfo all now abort if they were loaded by the parent before fork. Letting spawn fail and surfacing it to the caller lets the in-process fallback take over instead. The worker also now redirects its stdout/stderr to $SQLIT_WORKER_LOG (defaults to <tmpdir>/sqlit-worker.log) and enables faulthandler, so future C-level faults aren't invisible behind a closed pipe — that's what let me find the segfault site in the first place.

  2. Pre-spawn the worker in cli.py before App.__init__. This is the long-term fix from the issue addendum. multiprocessing.spawn snapshots the parent's open FDs at spawn time, and Textual's App.__init__ registers signal-handler pipes that aren't inheritable — that's the bad value(s) in fds_to_keep ValueError. The previous lazy creation in idle_scheduler.py:_warm was racing this. Spawning before SSMSTUI(...) is constructed is the latest moment at which the FD set is still clean. The pre-spawned client is plumbed in via a new process_worker_client kwarg on SSMSTUI; ProcessWorkerLifecycleMixin already returns the cached client first, so no other changes are needed. If pre-spawn fails for any reason, the lazy path still runs (and on darwin, that path now correctly falls back to in-process execution per commit 1).

  3. Unit tests for the new code paths. New tests/unit/test_process_worker_fallback.py and tests/unit/test_cli_prewarm.py cover the darwin re-raise, the worker-log env override, the prewarm gating on runtime.process_worker / runtime.mock, prewarm exception-swallowing, and the SSMSTUI kwarg plumbing.

Result on macOS Tahoe 26.4.1 / Python 3.14

State Before After commit 1 only After both
Worker crash on tree expand / query Hard segfault, Worker connection closed. In-process fallback, popup every query Worker runs, no popup
Query latency (25-row PG table over WAN+TLS) n/a (crash) ~10s (UI thread blocked) ~0.5s
Query cancellation Not available Not available (in-process fallback) Available

Tests

uv run pytest tests/unit — 837 passed, 1 skipped, including the 9 new tests in this PR.

Related upstream regressions on Tahoe (for context — this isn't sqlit-specific)

Tradeoffs to flag

  • Commit 1 costs query cancellation on darwin when spawn fails (the in-process fallback can't cancel mid-query). Commit 2 makes that case rare in practice.
  • Commit 2 always spawns the worker at startup when process_worker=True (the default) — there's no longer an idle-only path on the first launch. Cost: one extra Python subprocess at TUI startup. Benefit: no race, no popup, worker is hot.

Happy to iterate on either commit if you'd prefer a different approach.

@Maxteabag
Copy link
Copy Markdown
Owner

Hey! Thanks so much for sharing the fix.

It seems like a lot of different thing in here are bundled together - of completely different nature. As you suggested yourself, I think it would be a good idea to split this into two PRs.

@brentrager brentrager force-pushed the fix/macos-tahoe-fork-segfault branch from 65b1833 to 89b3469 Compare May 23, 2026 13:16
@brentrager brentrager changed the title Fix #189: process worker fork-safety on macOS Tahoe (+ psycopg3) Fix #189: process worker fork-safety on macOS Tahoe May 23, 2026
@brentrager
Copy link
Copy Markdown
Author

Done — split into two PRs as you suggested:

I force-pushed brentrager:fix/macos-tahoe-fork-segfault to remove the psycopg3 commits from this PR's history. Let me know if you'd like further reshaping on either one.

…r stderr

Two changes for the macOS Tahoe segfault reported in issue Maxteabag#189:

1. process_worker_client.py: drop the spawn -> fork fallback when running
   on darwin. macOS Tahoe (26+) widened the set of system libraries that
   abort if initialized after fork() — CoreFoundation, Security.framework,
   libsystem_trace, libsystem_info — so any library the parent has loaded
   (psycopg, getaddrinfo, setproctitle, NSCharacterSet) can crash the
   forked child. Letting spawn fail and surfacing it to the caller lets
   the in-process fallback in query_execution take over instead.

   Cost: query cancellation isn't available when spawn fails. Gain: no
   hard crashes ("Worker connection closed.") on macOS.

2. process_worker.py: redirect the worker's stdout/stderr to a log file
   and enable faulthandler at entry. The worker can SIGPIPE on libpq
   notice writes when its inherited FDs point to a Textual-managed pipe,
   and any future C-level fault on the worker side is otherwise invisible
   to the parent (the pipe just dies). The log path honors
   $SQLIT_WORKER_LOG and defaults to <tmpdir>/sqlit-worker.log.

Refs the celery (Tahoe + setproctitle, celery#9894) and gunicorn
(Tahoe + NSCharacterSet, gunicorn#3526) reports for the broader fork-
safety regression on Tahoe.
The first two commits made the macOS Tahoe segfault non-fatal by
disabling the fork fallback, but every spawn after Textual booted
still failed with `bad value(s) in fds_to_keep` — leaving the user
with a "Process worker unavailable; falling back" popup and the
slower in-process executor on every query.

Root cause: `multiprocessing.spawn` snapshots the parent's open file
descriptors at spawn time. Once Textual's `App.__init__` has
registered its signal-handler pipes and event-loop FDs, several of
those are not inheritable, and spawn aborts before the child even
runs. The previous lazy creation in the idle scheduler always hit
this race.

Fix: spawn the worker in `cli.py` *before* `SSMSTUI(...)` is
constructed. That's the latest moment at which the parent's FD set
is still clean. The new `_prewarm_process_worker(runtime)` helper
returns a live `ProcessWorkerClient`, or `None` if the worker is
disabled / mocked / spawn raised — in which case we fall through to
the lazy path inside the UI (with the in-process fallback on
darwin).

The pre-spawned client is passed into `SSMSTUI` via a new
`process_worker_client` kwarg and stashed as
`self._process_worker_client`. `ProcessWorkerLifecycleMixin` already
returns the cached client first, so no other changes are needed.

Verified on macOS Tahoe 26.4.1 / Python 3.14 / PostgreSQL over TLS:
the popup no longer appears and queries run in the worker process.
Covers the load-bearing behaviors introduced by the prior commits:

- _maybe_fallback_start re-raises on darwin (commit 1)
- _maybe_fallback_start still re-raises non-fds_to_keep errors verbatim
- _open_worker_log honors $SQLIT_WORKER_LOG and survives an unwritable path
- _prewarm_process_worker respects runtime.process_worker, runtime.mock,
  and swallows spawn exceptions so startup never crashes (commit 3)
- SSMSTUI(process_worker_client=client) stashes the client on
  self._process_worker_client, which is what ProcessWorkerLifecycleMixin
  consults first
@brentrager brentrager force-pushed the fix/macos-tahoe-fork-segfault branch from 89b3469 to 6016067 Compare May 25, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broken pipe/worker connection closed errors in TUI only, inconsistently

2 participants