Skip to content

[Bug]: UI WebSocket disconnects/reconnects continuously over mildly lossy or packet-reordering links (e.g. WireGuard), despite a healthy server #3054

@davidk747

Description

@davidk747

Before submitting

  • I searched existing issues and did not find a duplicate.
  • I included enough detail to reproduce or investigate the problem.

Area

apps/web

Steps to reproduce

Summary

The browser↔server UI WebSocket (/ws) drops and reconnects every few seconds when the client is on a link with mild packet reordering and/or loss — in our case a WireGuard "road-warrior" tunnel (OPNsense if_wg). The underlying TCP connection stays alive the whole time, and other long-lived apps over the same link (IMAP, plain HTTPS) are unaffected. The WS layer declares "disconnected" on brief stalls that TCP recovers from on its own.

This also intermittently leaves a thread stuck on "awaiting input" with no rendered content — the orphaned-thread behavior in #313 — which appears to be a downstream symptom of this same reconnect churn.

Environment

  • t3code v0.0.27 (server), behind Caddy (reverse proxy, HTTP/2; HTTP/3 also tested).
  • Clients: Android Chrome and desktop Chrome, both reaching the server over a WireGuard tunnel terminating on an OPNsense firewall.

Evidence it is not the server, proxy, or network config

  1. Server/proxy are stable. An authenticated WS client run from the server host — both directly to the t3 process (loopback) and through Caddy — held 70 s with zero drops, while the user's browser was dropping during that same window.

  2. It's specific to the lossy path, measured live. ss -ti on the host, comparing all established connections at one instant:

    Path reord_seen retransmits
    WireGuard client (the browser) 25 (peaked at 206) 0/336
    non-WG connections (incl. public internet) 0 0/4 – 0/15
    loopback 0 minimal

    The WG connection also showed cwnd collapsed to ~7 and ~27 ms jitter — but the TCP socket stayed established and transferred 60+ MB. Jittery, not dead.

  3. Not MTU / offload. MSS is correctly clamped (mss:1360, pmtu:1500); NIC hardware offload (TSO/LRO/CRC) is disabled. Oversized-packet black-holing is ruled out.

Conclusion: TCP survives the reordering/loss; it's the WebSocket keepalive/heartbeat that tears the connection down.

Suspected root cause

The WS keepalive is too aggressive for imperfect links — a single brief stall (reordering or a retransmit) trips a disconnect instead of being ridden out. Apps without an aggressive heartbeat over the identical tunnel are fine.

Requested change

  1. Make the WS keepalive tolerant of brief stalls/reordering before tearing down — a longer ping timeout, several missed beats before declaring dead, and ideally a configurable timeout for users on high-latency/VPN/mobile links.
  2. Ensure a reconnect re-attaches cleanly without orphaning a pending thread — this is the root cause behind Transient reconnect leaves thread stuck in error and can orphan pending user turns #313.

Related

Repro

Use t3code over any link with mild loss + packet reordering. On a Linux box you can emulate it on the client (or a gateway):

sudo tc qdisc add dev <iface> root netem delay 30ms 20ms reorder 5% loss 0.2%

Open a thread and watch the UI cycle "disconnected → reconnected" every few seconds while the page is otherwise reachable. Remove with sudo tc qdisc del dev <iface> root.

Expected behavior

Requested change

  1. Make the WS keepalive tolerant of brief stalls/reordering before tearing down — a longer ping timeout, several missed beats before declaring dead, and ideally a configurable timeout for users on high-latency/VPN/mobile links.
  2. Ensure a reconnect re-attaches cleanly without orphaning a pending thread — this is the root cause behind Transient reconnect leaves thread stuck in error and can orphan pending user turns #313.

Actual behavior

Summary

The browser↔server UI WebSocket (/ws) drops and reconnects every few seconds when the client is on a link with mild packet reordering and/or loss — in our case a WireGuard "road-warrior" tunnel (OPNsense if_wg). The underlying TCP connection stays alive the whole time, and other long-lived apps over the same link (IMAP, plain HTTPS) are unaffected. The WS layer declares "disconnected" on brief stalls that TCP recovers from on its own.

This also intermittently leaves a thread stuck on "awaiting input" with no rendered content — the orphaned-thread behavior in #313 — which appears to be a downstream symptom of this same reconnect churn.

Environment

  • t3code v0.0.27 (server), behind Caddy (reverse proxy, HTTP/2; HTTP/3 also tested).
  • Clients: Android Chrome and desktop Chrome, both reaching the server over a WireGuard tunnel terminating on an OPNsense firewall.

Evidence it is not the server, proxy, or network config

  1. Server/proxy are stable. An authenticated WS client run from the server host — both directly to the t3 process (loopback) and through Caddy — held 70 s with zero drops, while the user's browser was dropping during that same window.

  2. It's specific to the lossy path, measured live. ss -ti on the host, comparing all established connections at one instant:

    Path reord_seen retransmits
    WireGuard client (the browser) 25 (peaked at 206) 0/336
    non-WG connections (incl. public internet) 0 0/4 – 0/15
    loopback 0 minimal

    The WG connection also showed cwnd collapsed to ~7 and ~27 ms jitter — but the TCP socket stayed established and transferred 60+ MB. Jittery, not dead.

  3. Not MTU / offload. MSS is correctly clamped (mss:1360, pmtu:1500); NIC hardware offload (TSO/LRO/CRC) is disabled. Oversized-packet black-holing is ruled out.

Conclusion: TCP survives the reordering/loss; it's the WebSocket keepalive/heartbeat that tears the connection down.

Suspected root cause

The WS keepalive is too aggressive for imperfect links — a single brief stall (reordering or a retransmit) trips a disconnect instead of being ridden out. Apps without an aggressive heartbeat over the identical tunnel are fine.

Impact

Minor bug or occasional failure

Version or commit

No response

Environment

No response

Logs or stack traces

Screenshots, recordings, or supporting files

No response

Workaround

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken or behaving incorrectly.needs-triageIssue needs maintainer review and initial categorization.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions