Skip to content

fix(sandbox): SETUID/SETGID/DAC_OVERRIDE caps + env-var secrets#87

Merged
dviejokfs merged 6 commits into
mainfrom
fix/sandbox-caps-and-env-var-secrets
May 12, 2026
Merged

fix(sandbox): SETUID/SETGID/DAC_OVERRIDE caps + env-var secrets#87
dviejokfs merged 6 commits into
mainfrom
fix/sandbox-caps-and-env-var-secrets

Conversation

@dviejokfs
Copy link
Copy Markdown
Contributor

@dviejokfs dviejokfs commented May 11, 2026

Summary

  • Sandbox cap fix (the primary motivation): workspace sandboxes were aborting at startup with SandboxCreationFailed("chown-work exit 1, write-probe exit 1") because cap_drop=ALL was paired with only CHOWN+FOWNER re-added. su temps -c ... failed with "cannot set groups: Operation not permitted" and root's chown -R /home/temps hit EACCES on readdir(). This adds the minimum extra caps required for normalize_ownership to succeed on stock Docker Desktop / macOS bind-mounts: DAC_OVERRIDE (traverse stale-perm dirs), SETUID + SETGID (so su can drop into the sandbox user).
  • Env-var secrets: new is_secret write-only flag on env_vars. Plaintext is masked on read, the flag is one-way (a secret can't be downgraded), and updates with value: None keep the existing ciphertext. Includes migration, entities, service, handler, and web UI for the create/edit/list flows.
  • Deployment workflow: sensitive_envelope service + planner/exec refactor for the new secret-handling path.
  • Notifications + monitoring web tweaks aligned with the env-var changes.
  • .gitignore: exclude stray local cross-build binaries at repo root (/temps-new, /temps-old) to avoid accidentally committing 250+ MB ELF artifacts.

Why this matters

Workspace sandbox provisioning has been broken on macOS Docker Desktop for everyone hitting the "refresh sandbox" button on a fresh session — first-failure path. The cap list at crates/temps-agents/src/sandbox/docker.rs:1380 was the root cause; the chown-on-host hack from PR #84 only papered over the symptom for the bind-mount, not the su failure.

The env-var-secrets work has been sitting in the working tree for a while and is bundled here so it doesn't keep diverging.

Test plan

  • On macOS + Docker Desktop, click "refresh sandbox" on a brand-new workspace session. Sandbox should provision; chown-home / chown-work / write-probe should all log exit 0 (or chown-work=EPERM but probe=0 on virtiofs, which the code already handles).
  • Create a regular env var, then PATCH it with is_secret: true. Confirm subsequent GETs mask the value and that is_secret: false is rejected.
  • Verify deployment with a secret env var still injects the plaintext into the runtime container via sensitive_envelope.
  • Notifications create/edit pages still work end-to-end.
  • cargo check --lib --workspace clean (confirmed locally).

dviejokfs added 6 commits May 11, 2026 16:00
…eature

Sandbox containers were starting with cap_drop=ALL and only CHOWN/FOWNER
re-added, which meant `su temps -c ...` failed with "cannot set groups:
Operation not permitted" and the post-start chown of /home/temps hit
EACCES on readdir. Every fresh workspace sandbox aborted at startup with
SandboxCreationFailed("chown-work exit 1, write-probe exit 1").

Add the minimum caps needed for normalize_ownership to work on stock
Docker Desktop + macOS bind-mounts: DAC_OVERRIDE (so root can traverse
stale-perm dirs), SETUID/SETGID (so `su` can switch into the temps user
for the write probe and every subsequent workflow step).

Also bundled in this PR:
- Env-var secrets (write-only flag): is_secret column on env_vars,
  masked on read, immutable once set. New migration + entities/services/
  handlers + web UI for create/edit/list flows.
- Deployment workflow updates: sensitive_envelope service, planner/exec
  refactor for the new secret-handling path.
- Notifications + monitoring web tweaks aligned with the env-var changes.
- .gitignore: exclude stray local cross-build binaries at repo root.
Adds full-list dimension pages reachable from each overview chart
(events, referrers, browsers, OS, devices, locations, channels,
languages, UTM) so users can see beyond the top 5/10. Each list
fetches up to the API cap of 100 and supports client-side filtering.

Event rows on the dimensions page now link to the existing event
detail view, which gains "first → last" timestamp range, device,
and richer per-visitor columns.

Date filter (filter / from / to, including custom ranges) now
propagates across every navigation hop: overview → dimensions →
event detail → back. Previously the event detail page hardcoded
24h and dropped the user's selected range.
Workspace sessions get a project-scoped deployment token at sandbox-init
time with a 6h expiry (see MessageExecutor::issue_session_token). Without
a refresher, any session left idle past that window starts 401'ing every
call from the in-sandbox CLI / credential daemon back to the control
plane — the token expires but nothing re-issues it.

Add a background TokenRefresher spawned from the workspace plugin's init
phase. Every 30 min it scans active workspace_sessions for matching
deployment_tokens rows whose expires_at falls within the next 90 min and
calls MessageExecutor::refresh_sandbox for each — which re-issues the
token, rewrites ~/.env, and updates the credential daemon's env file.

Known limitation, documented on refresh_sandbox: the container's process
env (TEMPS_API_TOKEN set at docker create) stays stale forever. Anything
that reads `. ~/.env` per command picks up the new token transparently;
anything that captured TEMPS_API_TOKEN at process start doesn't. Fixing
that is a separate ticket — moving in-sandbox consumers to a per-request
file read.

Tests:
- find_no_active_sessions_returns_empty_without_join: 0 active sessions
  → query returns empty without ever hitting the join query.
- find_returns_due_rows: 2 rows seeded, both returned with correct
  session_id / token_id / expires_at.
- defaults_are_sane: poll interval < refresh threshold < token lifetime.
`get_or_create_deployment_token` is called at deploy time to mint the
`TEMPS_API_TOKEN` env var burned into every deployed container. That
value never gets refreshed for the container's lifetime, so the token
must outlive every redeploy of every app in the project — by convention
the deploy-path tokens are the only rows with `expires_at IS NULL`.

The lookup filter was `expires_at IS NULL OR expires_at > now()`, which
happily returned any other subsystem's short-lived token if it hadn't
expired yet:

  - workflow-run-* tokens have a 2h expiry
  - workspace-session-* tokens have a 6h expiry

In production we saw temps-landing-new redeployed during a 2h workflow
window pick up a `workflow-run-daily-traffic-report-*` token, bake it
into TEMPS_API_TOKEN, then 401 every AI Gateway call once the window
closed. The container's env never gets rewritten, so the 401s are
permanent until the next redeploy.

Fix: filter to `expires_at IS NULL` only. Short-lived rows are skipped
entirely; the function falls through to the create branch and mints a
fresh permanent token, matching the long-standing assumption that
"`Auto-generated for environment N` tokens never expire".

Also fix the silent error swallow at `temps-auth/src/middleware.rs` —
both `tk_` and `dt_` token validation failures were being discarded
with `if let Ok(...)`, which is exactly why the prod 401 was so hard
to root-cause. Replace with explicit match arms that log the failure
reason at WARN, including the (non-sensitive) 8-char token prefix.
Operators can now grep `/tmp/temps-serve.log` for "Deployment token
auth failed" and immediately see whether the cause was TokenExpired,
TokenInactive, InvalidToken, or a database error.

Tests:
- test_get_or_create_skips_short_lived_tokens: seeds a non-expired
  workflow-run-style token, calls get_or_create, asserts the returned
  plaintext is NOT the seeded one AND that a permanent (expires_at
  IS NULL) row was created.
- test_get_or_create_reuses_permanent_token: two consecutive calls
  return the same plaintext, proving redeploys keep the same token.
…, cleanup env settings

Bundled UI work spanning three areas. Analytics: add per-dimension
"View all" pages (events, referrers, browsers, OS, devices, locations,
channels, languages, UTM) wired from each overview chart, plus a
segment-visitors drill-down route and supporting backend on the
property-breakdown endpoint. Runtime logs: reduce inner log pane
height so the Live/History tabs stay visible while scrolling.
Environment settings: drop redundant Domains card, read-only env
variables card, and CurrentDeployment row (already surfaced in the
header bar); leaves Configuration + Asset Cache + Danger Zone.

- New: web/src/components/analytics/DimensionList.tsx, SegmentVisitors.tsx
- New backend: crates/temps-analytics property-breakdown extensions
  (analytics.rs, handler.rs, traits.rs, requests.rs)
- View-all buttons added to all overview charts; events rows in the
  dimension list link to /analytics/events/:name preserving date range
- LogViewer pane height: calc(100vh-280px) -> calc(100vh-360px)
- EnvironmentDetail.tsx: -260 lines (Domains, Env Variables,
  CurrentDeployment all removed; dead imports cleaned up)
@dviejokfs dviejokfs merged commit 51c8b4d into main May 12, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant