fix(sandbox): SETUID/SETGID/DAC_OVERRIDE caps + env-var secrets#87
Merged
Conversation
…eature
Sandbox containers were starting with cap_drop=ALL and only CHOWN/FOWNER
re-added, which meant `su temps -c ...` failed with "cannot set groups:
Operation not permitted" and the post-start chown of /home/temps hit
EACCES on readdir. Every fresh workspace sandbox aborted at startup with
SandboxCreationFailed("chown-work exit 1, write-probe exit 1").
Add the minimum caps needed for normalize_ownership to work on stock
Docker Desktop + macOS bind-mounts: DAC_OVERRIDE (so root can traverse
stale-perm dirs), SETUID/SETGID (so `su` can switch into the temps user
for the write probe and every subsequent workflow step).
Also bundled in this PR:
- Env-var secrets (write-only flag): is_secret column on env_vars,
masked on read, immutable once set. New migration + entities/services/
handlers + web UI for create/edit/list flows.
- Deployment workflow updates: sensitive_envelope service, planner/exec
refactor for the new secret-handling path.
- Notifications + monitoring web tweaks aligned with the env-var changes.
- .gitignore: exclude stray local cross-build binaries at repo root.
Adds full-list dimension pages reachable from each overview chart (events, referrers, browsers, OS, devices, locations, channels, languages, UTM) so users can see beyond the top 5/10. Each list fetches up to the API cap of 100 and supports client-side filtering. Event rows on the dimensions page now link to the existing event detail view, which gains "first → last" timestamp range, device, and richer per-visitor columns. Date filter (filter / from / to, including custom ranges) now propagates across every navigation hop: overview → dimensions → event detail → back. Previously the event detail page hardcoded 24h and dropped the user's selected range.
Workspace sessions get a project-scoped deployment token at sandbox-init time with a 6h expiry (see MessageExecutor::issue_session_token). Without a refresher, any session left idle past that window starts 401'ing every call from the in-sandbox CLI / credential daemon back to the control plane — the token expires but nothing re-issues it. Add a background TokenRefresher spawned from the workspace plugin's init phase. Every 30 min it scans active workspace_sessions for matching deployment_tokens rows whose expires_at falls within the next 90 min and calls MessageExecutor::refresh_sandbox for each — which re-issues the token, rewrites ~/.env, and updates the credential daemon's env file. Known limitation, documented on refresh_sandbox: the container's process env (TEMPS_API_TOKEN set at docker create) stays stale forever. Anything that reads `. ~/.env` per command picks up the new token transparently; anything that captured TEMPS_API_TOKEN at process start doesn't. Fixing that is a separate ticket — moving in-sandbox consumers to a per-request file read. Tests: - find_no_active_sessions_returns_empty_without_join: 0 active sessions → query returns empty without ever hitting the join query. - find_returns_due_rows: 2 rows seeded, both returned with correct session_id / token_id / expires_at. - defaults_are_sane: poll interval < refresh threshold < token lifetime.
`get_or_create_deployment_token` is called at deploy time to mint the `TEMPS_API_TOKEN` env var burned into every deployed container. That value never gets refreshed for the container's lifetime, so the token must outlive every redeploy of every app in the project — by convention the deploy-path tokens are the only rows with `expires_at IS NULL`. The lookup filter was `expires_at IS NULL OR expires_at > now()`, which happily returned any other subsystem's short-lived token if it hadn't expired yet: - workflow-run-* tokens have a 2h expiry - workspace-session-* tokens have a 6h expiry In production we saw temps-landing-new redeployed during a 2h workflow window pick up a `workflow-run-daily-traffic-report-*` token, bake it into TEMPS_API_TOKEN, then 401 every AI Gateway call once the window closed. The container's env never gets rewritten, so the 401s are permanent until the next redeploy. Fix: filter to `expires_at IS NULL` only. Short-lived rows are skipped entirely; the function falls through to the create branch and mints a fresh permanent token, matching the long-standing assumption that "`Auto-generated for environment N` tokens never expire". Also fix the silent error swallow at `temps-auth/src/middleware.rs` — both `tk_` and `dt_` token validation failures were being discarded with `if let Ok(...)`, which is exactly why the prod 401 was so hard to root-cause. Replace with explicit match arms that log the failure reason at WARN, including the (non-sensitive) 8-char token prefix. Operators can now grep `/tmp/temps-serve.log` for "Deployment token auth failed" and immediately see whether the cause was TokenExpired, TokenInactive, InvalidToken, or a database error. Tests: - test_get_or_create_skips_short_lived_tokens: seeds a non-expired workflow-run-style token, calls get_or_create, asserts the returned plaintext is NOT the seeded one AND that a permanent (expires_at IS NULL) row was created. - test_get_or_create_reuses_permanent_token: two consecutive calls return the same plaintext, proving redeploys keep the same token.
…, cleanup env settings Bundled UI work spanning three areas. Analytics: add per-dimension "View all" pages (events, referrers, browsers, OS, devices, locations, channels, languages, UTM) wired from each overview chart, plus a segment-visitors drill-down route and supporting backend on the property-breakdown endpoint. Runtime logs: reduce inner log pane height so the Live/History tabs stay visible while scrolling. Environment settings: drop redundant Domains card, read-only env variables card, and CurrentDeployment row (already surfaced in the header bar); leaves Configuration + Asset Cache + Danger Zone. - New: web/src/components/analytics/DimensionList.tsx, SegmentVisitors.tsx - New backend: crates/temps-analytics property-breakdown extensions (analytics.rs, handler.rs, traits.rs, requests.rs) - View-all buttons added to all overview charts; events rows in the dimension list link to /analytics/events/:name preserving date range - LogViewer pane height: calc(100vh-280px) -> calc(100vh-360px) - EnvironmentDetail.tsx: -260 lines (Domains, Env Variables, CurrentDeployment all removed; dead imports cleaned up)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SandboxCreationFailed("chown-work exit 1, write-probe exit 1")becausecap_drop=ALLwas paired with onlyCHOWN+FOWNERre-added.su temps -c ...failed with "cannot set groups: Operation not permitted" and root'schown -R /home/tempshitEACCESonreaddir(). This adds the minimum extra caps required fornormalize_ownershipto succeed on stock Docker Desktop / macOS bind-mounts:DAC_OVERRIDE(traverse stale-perm dirs),SETUID+SETGID(sosucan drop into the sandbox user).is_secretwrite-only flag onenv_vars. Plaintext is masked on read, the flag is one-way (a secret can't be downgraded), and updates withvalue: Nonekeep the existing ciphertext. Includes migration, entities, service, handler, and web UI for the create/edit/list flows.sensitive_envelopeservice + planner/exec refactor for the new secret-handling path..gitignore: exclude stray local cross-build binaries at repo root (/temps-new,/temps-old) to avoid accidentally committing 250+ MB ELF artifacts.Why this matters
Workspace sandbox provisioning has been broken on macOS Docker Desktop for everyone hitting the "refresh sandbox" button on a fresh session — first-failure path. The cap list at
crates/temps-agents/src/sandbox/docker.rs:1380was the root cause; the chown-on-host hack from PR #84 only papered over the symptom for the bind-mount, not thesufailure.The env-var-secrets work has been sitting in the working tree for a while and is bundled here so it doesn't keep diverging.
Test plan
chown-home/chown-work/write-probeshould all log exit 0 (or chown-work=EPERM but probe=0 on virtiofs, which the code already handles).is_secret: true. Confirm subsequent GETs mask the value and thatis_secret: falseis rejected.sensitive_envelope.cargo check --lib --workspaceclean (confirmed locally).