fix(relay): classify script quarantine by failure type#1368
Closed
maybeknott wants to merge 3 commits into
Closed
Conversation
Apps Script quota is consumed per relay invocation, but a plain round-robin selector has no memory of how heavily this client has used each deployment inside the recent quota window. When multiple script IDs are configured, continuing to select an already saturated deployment while another configured deployment is still locally underused wastes available capacity and increases the chance of quota-related relay stalls. DomainFronter now keeps a per-script local ledger of selection timestamps in a rolling 24-hour window. Before choosing a script ID, the selector prunes expired observations and prefers non-blacklisted deployments whose local call count remains below the free-tier request budget. Both the single-request selector and the parallel fan-out selector use the same ledger so Apps Script batches and relay fan-out draw from the same local capacity model. The ledger records selections at dispatch time. That deliberately accounts for concurrent fan-out attempts and for requests that may still complete server-side after the Rust future is dropped. The ledger is a local steering signal rather than an authoritative Google quota reading: if every non-blacklisted deployment is locally saturated, the selector still returns a deployment instead of creating a client-side outage. This preserves connectivity for paid Workspace quotas, shared deployments whose external usage is invisible to this process, and cases where the local estimate is conservative. Selection remains decoupled from the existing failure blacklist. Blacklisted deployments are still skipped first; the rolling quota ledger only orders otherwise healthy deployments by locally observed capacity. If all deployments are blacklisted, the existing earliest-cooldown recovery path is preserved and the selected deployment is recorded in the ledger. The guide now describes the local rolling 24-hour ledger in the Full Mode deployment-scaling section, including the fact that it steers away from deployments this client has already driven near the free-tier request budget. Unit coverage exercises saturated deployment skipping, expired observation pruning, all-saturated connectivity fallback, and parallel selection preferring unsaturated deployments.
A single cooldown duration is too coarse for Apps Script deployment failures. Quota exhaustion and account-level authorization failures recover on a much longer cadence than transient Google edge or Apps Script backend failures. Treating both classes the same either probes exhausted deployments too aggressively or removes transiently unhealthy deployments for longer than necessary. Relay failure handling now classifies script failures into two explicit quarantine classes. HTTP 429, HTTP 403, and response bodies that match quota or service-invocation limit text are treated as hard quota/account failures and quarantined for 24 hours. Google or Apps Script transient 5xx responses are treated as temporary relay failures and use the existing short cooldown window. The transient class is deliberately narrow. Generic upstream 5xx bodies such as a destination-origin bad gateway do not quarantine a script ID by themselves; the body must look like a Google, Apps Script, GFE, backend, service-unavailable, temporary, or timeout failure. This avoids punishing healthy deployments for ordinary origin-side errors that Apps Script relayed correctly. The same classifier is used across the direct relay path, h1 fallback path, tunnel single-operation path, and tunnel batch path. Quota-like errors returned inside the Apps Script JSON envelope still force the hard quarantine path even when the outer HTTP status is 200. The English and Persian guides now describe auto-quarantine as two failure classes instead of a single ten-minute blacklist. Unit coverage verifies hard quota/account classification, transient Google-edge classification, ordinary upstream 5xx pass-through, and the quarantine durations for both classes.
74d3071 to
48cd891
Compare
Add a read-only per-deployment health snapshot over the existing relay state so operators can inspect how deployment selection is behaving without changing the scheduler itself. The snapshot reports masked script IDs, locally observed rolling quota usage, the configured local quota threshold, saturation state, active cooldown seconds, cooldown reason, and timeout strike count. Cooldown reasons are tracked alongside the existing blacklist timestamps and are pruned whenever expired blacklist entries are removed. Surface the snapshot in the desktop UI as a collapsible Script health table, clear stale rows when the proxy stops or exits, and document that these values are local client observations rather than authoritative Google-side quota counters. Add focused unit coverage for quota saturation, cooldown reason exposure, timeout strike visibility, and compact duration formatting. The relay routing, quarantine durations, and selection behavior remain unchanged.
Author
|
Closing this standalone quota slice because the quota steering work has been folded into #1388, which now carries the relay batching, quota steering, failure quarantine, and script-health UI together as one coherent review unit. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A single cooldown duration is too coarse for Apps Script deployment failures. Quota exhaustion and account-level authorization failures recover on a much longer cadence than transient Google edge or Apps Script backend failures. Treating both classes the same either probes exhausted deployments too aggressively or removes transiently unhealthy deployments for longer than necessary.
Relay failure handling now classifies script failures into two explicit quarantine classes. HTTP 429, HTTP 403, and response bodies that match quota or service-invocation limit text are treated as hard quota/account failures and quarantined for 24 hours. Google or Apps Script transient 5xx responses are treated as temporary relay failures and use the existing short cooldown window.
The transient class is deliberately narrow. Generic upstream 5xx bodies such as a destination-origin bad gateway do not quarantine a script ID by themselves; the body must look like a Google, Apps Script, GFE, backend, service-unavailable, temporary, or timeout failure. This avoids punishing healthy deployments for ordinary origin-side errors that Apps Script relayed correctly.
The same classifier is used across the direct relay path, h1 fallback path, tunnel single-operation path, and tunnel batch path. Quota-like errors returned inside the Apps Script JSON envelope still force the hard quarantine path even when the outer HTTP status is 200.
The English and Persian guides now describe auto-quarantine as two failure classes instead of a single ten-minute blacklist. Unit coverage verifies hard quota/account classification, transient Google-edge classification, ordinary upstream 5xx pass-through, and the quarantine durations for both classes.