Skip to content

fix(relay): classify script quarantine by failure type#1368

Closed
maybeknott wants to merge 3 commits into
therealaleph:mainfrom
maybeknott:fix/script-failure-quarantine
Closed

fix(relay): classify script quarantine by failure type#1368
maybeknott wants to merge 3 commits into
therealaleph:mainfrom
maybeknott:fix/script-failure-quarantine

Conversation

@maybeknott
Copy link
Copy Markdown

A single cooldown duration is too coarse for Apps Script deployment failures. Quota exhaustion and account-level authorization failures recover on a much longer cadence than transient Google edge or Apps Script backend failures. Treating both classes the same either probes exhausted deployments too aggressively or removes transiently unhealthy deployments for longer than necessary.

Relay failure handling now classifies script failures into two explicit quarantine classes. HTTP 429, HTTP 403, and response bodies that match quota or service-invocation limit text are treated as hard quota/account failures and quarantined for 24 hours. Google or Apps Script transient 5xx responses are treated as temporary relay failures and use the existing short cooldown window.

The transient class is deliberately narrow. Generic upstream 5xx bodies such as a destination-origin bad gateway do not quarantine a script ID by themselves; the body must look like a Google, Apps Script, GFE, backend, service-unavailable, temporary, or timeout failure. This avoids punishing healthy deployments for ordinary origin-side errors that Apps Script relayed correctly.

The same classifier is used across the direct relay path, h1 fallback path, tunnel single-operation path, and tunnel batch path. Quota-like errors returned inside the Apps Script JSON envelope still force the hard quarantine path even when the outer HTTP status is 200.

The English and Persian guides now describe auto-quarantine as two failure classes instead of a single ten-minute blacklist. Unit coverage verifies hard quota/account classification, transient Google-edge classification, ordinary upstream 5xx pass-through, and the quarantine durations for both classes.

Apps Script quota is consumed per relay invocation, but a plain round-robin selector has no memory of how heavily this client has used each deployment inside the recent quota window. When multiple script IDs are configured, continuing to select an already saturated deployment while another configured deployment is still locally underused wastes available capacity and increases the chance of quota-related relay stalls.

DomainFronter now keeps a per-script local ledger of selection timestamps in a rolling 24-hour window. Before choosing a script ID, the selector prunes expired observations and prefers non-blacklisted deployments whose local call count remains below the free-tier request budget. Both the single-request selector and the parallel fan-out selector use the same ledger so Apps Script batches and relay fan-out draw from the same local capacity model.

The ledger records selections at dispatch time. That deliberately accounts for concurrent fan-out attempts and for requests that may still complete server-side after the Rust future is dropped. The ledger is a local steering signal rather than an authoritative Google quota reading: if every non-blacklisted deployment is locally saturated, the selector still returns a deployment instead of creating a client-side outage. This preserves connectivity for paid Workspace quotas, shared deployments whose external usage is invisible to this process, and cases where the local estimate is conservative.

Selection remains decoupled from the existing failure blacklist. Blacklisted deployments are still skipped first; the rolling quota ledger only orders otherwise healthy deployments by locally observed capacity. If all deployments are blacklisted, the existing earliest-cooldown recovery path is preserved and the selected deployment is recorded in the ledger.

The guide now describes the local rolling 24-hour ledger in the Full Mode deployment-scaling section, including the fact that it steers away from deployments this client has already driven near the free-tier request budget. Unit coverage exercises saturated deployment skipping, expired observation pruning, all-saturated connectivity fallback, and parallel selection preferring unsaturated deployments.
@github-actions github-actions Bot added the type: fix fix: PR — auto-applied by release-drafter label May 23, 2026
A single cooldown duration is too coarse for Apps Script deployment failures. Quota exhaustion and account-level authorization failures recover on a much longer cadence than transient Google edge or Apps Script backend failures. Treating both classes the same either probes exhausted deployments too aggressively or removes transiently unhealthy deployments for longer than necessary.

Relay failure handling now classifies script failures into two explicit quarantine classes. HTTP 429, HTTP 403, and response bodies that match quota or service-invocation limit text are treated as hard quota/account failures and quarantined for 24 hours. Google or Apps Script transient 5xx responses are treated as temporary relay failures and use the existing short cooldown window.

The transient class is deliberately narrow. Generic upstream 5xx bodies such as a destination-origin bad gateway do not quarantine a script ID by themselves; the body must look like a Google, Apps Script, GFE, backend, service-unavailable, temporary, or timeout failure. This avoids punishing healthy deployments for ordinary origin-side errors that Apps Script relayed correctly.

The same classifier is used across the direct relay path, h1 fallback path, tunnel single-operation path, and tunnel batch path. Quota-like errors returned inside the Apps Script JSON envelope still force the hard quarantine path even when the outer HTTP status is 200.

The English and Persian guides now describe auto-quarantine as two failure classes instead of a single ten-minute blacklist. Unit coverage verifies hard quota/account classification, transient Google-edge classification, ordinary upstream 5xx pass-through, and the quarantine durations for both classes.
@maybeknott maybeknott force-pushed the fix/script-failure-quarantine branch from 74d3071 to 48cd891 Compare May 24, 2026 00:32
Add a read-only per-deployment health snapshot over the existing relay state so operators can inspect how deployment selection is behaving without changing the scheduler itself.

The snapshot reports masked script IDs, locally observed rolling quota usage, the configured local quota threshold, saturation state, active cooldown seconds, cooldown reason, and timeout strike count. Cooldown reasons are tracked alongside the existing blacklist timestamps and are pruned whenever expired blacklist entries are removed.

Surface the snapshot in the desktop UI as a collapsible Script health table, clear stale rows when the proxy stops or exits, and document that these values are local client observations rather than authoritative Google-side quota counters.

Add focused unit coverage for quota saturation, cooldown reason exposure, timeout strike visibility, and compact duration formatting. The relay routing, quarantine durations, and selection behavior remain unchanged.
@maybeknott
Copy link
Copy Markdown
Author

Closing this standalone quota slice because the quota steering work has been folded into #1388, which now carries the relay batching, quota steering, failure quarantine, and script-health UI together as one coherent review unit.

@maybeknott maybeknott closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: fix fix: PR — auto-applied by release-drafter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant