Skip to content

feat(DATAGO-130042): synthetic-monitor auth path for Datadog smoke tests#1461

Merged
JKaram merged 11 commits into
mainfrom
JKaram/DATAGO-130042/synthetic-monitor-auth
May 11, 2026
Merged

feat(DATAGO-130042): synthetic-monitor auth path for Datadog smoke tests#1461
JKaram merged 11 commits into
mainfrom
JKaram/DATAGO-130042/synthetic-monitor-auth

Conversation

@JKaram
Copy link
Copy Markdown
Contributor

@JKaram JKaram commented Apr 27, 2026

What is the purpose of this change?

Adds a config-gated, non-interactive auth path so Datadog Synthetics (and any future machine client) can authenticate against the gateway without driving the SSO UI. Enables an upcoming end-to-end smoke test that sends a chat prompt and asserts a non-empty LLM response.

solace-chat today only supports OIDC Authorization Code flow against Entra ID — every authenticated caller must go through the user login UI. That doesn't work for synthetic monitoring.

How was this change implemented?

New module src/solace_agent_mesh/shared/auth/synthetic.py:

  • Loads Datadog client app GUIDs, role name, audience, tenant ID, and an endpoint allowlist from component config. Fail-closed (any missing required field, empty appid allowlist, or empty endpoint allowlist disables the path entirely).
  • Validates Entra-issued client-credentials JWTs locally via PyJWKClient (existing dependency).
  • Distinguishes "not a synthetic token" (SyntheticTokenNotApplicable, caller falls through to other auth paths) from "synthetic token but invalid" (SyntheticTokenInvalid, caller hard-rejects). This boundary prevents an attacker from probing the validator by varying claims.

Modified src/solace_agent_mesh/shared/auth/middleware.py:

  • Loads synthetic config once at startup; logs a single info line when enabled.
  • New _try_synthetic_auth runs ahead of the existing sam_access_token and IdP branches.
  • Maps a valid synthetic token to a fixed principal (user_id = "synthetic-monitor", non-routable email, is_synthetic flag) so downstream code that expects a current_user keeps working unchanged.
  • Endpoint allowlist (method + path regex) enforced at the auth layer before any handler runs. Default deny.

Key Design Decisions

  • Local JWKS validation rather than delegating to the external auth service. Synthetic auth is security-sensitive (strict role match, appid allowlist, mixed-claim rejection) and co-locating validation with principal construction and endpoint enforcement makes it easier to audit.
  • Hard reject (don't fall through) on synthetic-shaped invalid tokens. A token claiming to be synthetic that fails validation is suspicious — falling through to other auth paths would let an attacker probe by varying claims.
  • Strict role equality (roles == [config.role_name]) rather than contains. Prevents privilege creep if a future role is granted alongside.
  • Non-routable synthetic email (synthetic-monitor@synthetics.invalid) — can't collide with a real user, can't be probed via account enumeration.
  • Fail-closed config loader — refuses to enable with empty allowlists. An empty appid allowlist would otherwise accept any tenant app with the role; an empty endpoint allowlist would deny everything.
  • is_synthetic schema flag and conversation cleanup are deferred as stretch goals. Synthetic conversations are identified by the well-known user_id = "synthetic-monitor"; cleanup can be added later without touching auth.

How was this change tested?

  • Unit tests: tests/unit/shared/auth/test_synthetic.py (29 tests, all passing in 0.44s)
    • Discriminator boundary (NotApplicable vs Invalid)
    • Every rejection path: alg=none, wrong audience/issuer/tenant, appid not allowlisted, extra roles, missing role, expired token, each user-identity claim parameterized
    • Happy path returns expected claims
    • Config fail-closed cases (disabled-by-default, missing field, empty allowlists)
    • Endpoint allowlist incl. method mismatch and path-extension attempts
  • Tests use a real RSA keypair generated per session — PyJWT signature/claim validation is actually exercised, not mocked
  • Integration tests against a running gateway: not yet — gated on Phase 2 (Entra setup)
  • Known limitation: rate limiting for the synthetic principal isn't implemented in this PR (lives elsewhere in the stack); needs a separate decision

Is there anything the reviewers should focus on/be aware of?

  • Security review of synthetic.py: each defense-in-depth check (signature, alg allowlist, iss, aud, tid, strict roles, appid allowlist, no-user-claims, fail-closed config) is independently load-bearing. Removing any one check creates a real vulnerability.
  • Ordering in middleware: the synthetic branch runs first when enabled; it must remain that way so synthetic tokens don't accidentally take the IdP fallback path.
  • Disabled by default. No behaviour change for existing deployments. Each environment (staging, prod) opts in via config:
    synthetic_auth_enabled: true
    synthetic_auth_tenant_id: <entra-tenant-guid>
    synthetic_auth_audience: api://<solace-chat-app-id-uri>
    synthetic_auth_role_name: Synthetics.Smoke
    synthetic_auth_appid_allowlist: [<datadog-sp-appid-guid>]
    synthetic_auth_endpoint_allowlist:
      - { method: GET, path: "^/api/v1/sessions$" }
      - { method: POST, path: "^/api/v1/messages$" }
      # plus whatever the chat smoke actually needs
  • Endpoint allowlist contents are placeholders — the real list depends on what the chat send-message + response-stream endpoints are exactly. Will be tightened during Phase 2 of the rollout once the smoke test is wired up.
  • The middleware lives in the shared solace-agent-mesh package, so the synthetic path is technically available to any deployment that opts in via config — not just solace-chat. Default-off behaviour means this is safe, but worth flagging.

Adds a config-gated, non-interactive auth path so Datadog Synthetics
(and any future machine client) can authenticate against the gateway
without driving the SSO UI. Used by an upcoming end-to-end smoke test
that sends a chat prompt and asserts a non-empty LLM response.

How it works:
- Caller mints an Entra ID JWT via OAuth client_credentials with a
  dedicated app role (no user identity in the token).
- New synthetic.py validates that JWT locally against the tenant JWKS
  and maps it to a fixed synthetic principal (user_id = "synthetic-
  monitor", non-routable email, is_synthetic flag) so downstream code
  that expects a current_user keeps working.
- Middleware runs the synthetic branch ahead of sam_access_token and
  IdP paths. Synthetic-shaped tokens that fail validation are hard-
  rejected (401) rather than falling through, so an attacker can't
  probe the validator by varying claims.

Security controls (each independently load-bearing):
- RS256-only; alg=none and HS variants refused by PyJWT.
- Standard JWT validation: signature, iss, aud, tid, exp, nbf, iat
  with 60s skew.
- Strict roles == [configured-role] (no privilege creep via extras).
- appid/azp must be in a configured allowlist of Datadog SP GUIDs.
- Any user-identity claim (sub/oid/preferred_username/upn/unique_name)
  hard-rejects the token.
- Endpoint allowlist (method + path regex) enforced at the auth layer
  before any handler runs. Default deny.
- Config fail-closed: missing required field, empty appid allowlist,
  or empty endpoint allowlist all disable the path entirely.

Disabled by default. Enabled via component config:
  synthetic_auth_enabled
  synthetic_auth_tenant_id
  synthetic_auth_audience
  synthetic_auth_role_name
  synthetic_auth_appid_allowlist
  synthetic_auth_endpoint_allowlist

Tests cover discriminator boundary (NotApplicable vs Invalid), every
rejection path, happy path, config fail-closed cases, and endpoint
allowlist behaviour including method mismatch and path-extension
attempts.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 27, 2026

✅ FOSSA Guard: Licensing (SolaceLabs_solace-agent-mesh) • PASSED

Compared against main (9b6a9c31cb6e44135812ad01cc06c612c0c8bd23) • 0 new, 9 total (9 in base)

Scan Report | View Details in FOSSA

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 27, 2026

✅ FOSSA Guard: Vulnerability (SolaceLabs_solace-agent-mesh) • PASSED

Compared against main (9b6a9c31cb6e44135812ad01cc06c612c0c8bd23) • 0 new, 14 total (14 in base)

Scan Report | View Details in FOSSA

@JKaram JKaram requested review from enavitan and mo-radwan1 April 28, 2026 18:44
Comment on lines +119 to +120
issuer = f"https://login.microsoftonline.com/{tenant_id}/v2.0"
jwks_uri = f"https://login.microsoftonline.com/{tenant_id}/discovery/v2.0/keys"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could allow other OIDC providers too, similar to how it is done here?
https://github.com/SolaceDev/solace-agent-mesh-enterprise/blob/main/examples/oauth2_config.yaml#L44

Copy link
Copy Markdown
Contributor Author

@JKaram JKaram Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point - But, Going generic across OIDC providers (Okta/Auth0/etc.) is a bigger change than the URL config though — claim names and semantics differ across IdPs. The oauth2_config.yaml analogy works for user-OIDC because sub/email/name are pretty standard, but client-credentials tokens are much less standardized. If a second IdP becomes a real requirement we can add a per-provider abstraction at that point — likely cleaner than parameterizing what's currently here.

For now this is solely for our internal Datadog Synthetic test against solace-chat, so I'd rather not preemptively build the generic shape.

JKaram added 10 commits May 5, 2026 11:14
The first real client_credentials token from Entra revealed two
over-strict checks that would reject every legitimately-issued token:

- Entra defaults to v1 issuer (sts.windows.net) for client_credentials
  unless the resource opts into v2 via accessTokenAcceptedVersion=2.
  Accept both v1 and v2 issuers for the configured tenant.
- Entra includes sub and oid in every app-only token to identify the
  service principal; they are not user identity. Drop them from the
  user-claim rejection list. preferred_username, upn, and unique_name
  remain — those only appear in user-context tokens.

Add appidacr ("Application Authentication Context Reference") as a
compensating control: must be "1" (client secret) or "2" (certificate).
Rejects "0" (public client, no auth) and missing-claim cases.

Tests: 29 → 31. New cases cover v1-issuer acceptance, sub/oid
acceptance, appidacr="0" rejection, and missing-appidacr rejection.
…peek

SonarQube's python:S5659 rightly flags any unverified jwt.decode call.
This particular call is a routing-only discriminator: it decides whether
to route the token through the synthetic auth branch vs fall through to
IdP/sam_access_token, before we know if Entra-JWKS signature verification
will succeed.

The peek is never used for any auth decision — all auth decisions
(appid allowlist, role strict match, claim-absence checks, audience,
issuer, tenant, expiry) use the verified claims obtained from a second
jwt.decode call below that does verify the signature against the JWKS.
A forged token that passes the routing peek is hard-rejected at the
signature verification step.

Strengthen the inline comment to make the intent and safety bound
unambiguous, and add NOSONAR(python:S5659) so reviewers see the
suppression is deliberate.
Address review feedback (#1461 r3163509636). The issuer/JWKS URLs are
no longer hard-tied to the Entra public cloud — two new optional
config keys override the auto-derived defaults:

- synthetic_auth_issuers: list of acceptable issuer strings
- synthetic_auth_jwks_uri:  JWKS URL for signature verification

Defaults still derive from synthetic_auth_tenant_id and target the
public Entra cloud (sts.windows.net + login.microsoftonline.com), so
existing setups need no changes.

Useful for:
- Entra Government (login.microsoftonline.us)
- Entra China (login.partner.microsoftonline.cn)
- Any future Entra-claim-shaped IdP at a non-default URL

Validation logic remains Entra-claim-shaped (appid, appidacr, roles),
so this is not a generalization to arbitrary OIDC providers — that
would require configurable claim names and per-provider validation,
a meaningfully larger change. Documented this scope in the PR thread.

Tests +2: default-issuers-target-entra-public-cloud,
issuer-and-jwks-overrides-apply.
Make synthetic-auth config-loading more diagnosable when staging deploys
mis-set env vars:
- Bumped startup ENABLED log to include issuer count and a clearer name.
- Added a startup WARNING when the path is NOT enabled, pointing at the
  config keys to check. Previously a misconfigured synthetic_auth_enabled
  silently dropped traffic to the IdP path with no signal.
- Hardened SyntheticAuthConfig parsing: env-var-substituted YAML can
  deliver list/dict-shaped fields as JSON strings; _coerce_to_list now
  parses those before validation so configs like
  synthetic_auth_appid_allowlist: '${...}' work without a manual !!seq.
The synthetic principal authenticated cleanly but every downstream
authorization check 403'd because AuthorizationService.get_scopes_for_user
fell back to MS Graph for role lookup. MS Graph 404s on the
synthetic-monitor identity since it isn't a real Entra user.

Per AuthorizationService.get_scopes_for_user (in solace-agent-mesh-
enterprise), if `roles` is non-None on user_state, it uses those
directly and skips role-provider lookup entirely. Same mechanism
sam_access_token already uses to bypass MS Graph.

Adds a new required config key `synthetic_auth_roles` (list of role
names defined in role-to-scope-definitions.yaml). build_synthetic_user_state
now sets `roles` on the returned state. Fail-closed: empty roles list
keeps the path disabled to avoid silently re-introducing the MS Graph
fallback.

Tests: 33 → 35. New: empty-roles disables, user state carries roles.
…d claims

_extract_initial_claims was building a fresh dict from request.state.user
that included only id/name/email/user_info — dropping any 'roles' key
that auth middleware had pre-populated. Downstream code
(submit_a2a_task → resolve_user_config → _get_user_state_roles)
checks for roles at the top level of user_identity, not inside
user_info, so the synthetic principal's pre-resolved roles never
reached AuthorizationService.get_scopes_for_user.

Effect was that even with synthetic_auth_roles=["SyntheticMonitor"]
correctly set on user_state, the downstream scope check fell back to
MS Graph role lookup, 404'd on the synthetic identity, and denied
agent access.

Forward roles to top-level claims in both return paths so any auth
method that pre-populates roles (sam_access_token already does this,
synthetic now does too) propagates them to the agent-access check.
Make the trust boundary on the role-forwarding path explicit. Today
user_state["roles"] is only populated by two server-trusted code paths
(synthetic auth pulls from YAML config; sam_access_token pulls from a
signed internal token verified by trust_manager), so the unconditional
forward in _extract_initial_claims is safe in practice. But the check
was an "if 'roles' is present" pattern, which would silently start
forwarding user-controlled JWT roles the moment any future auth path
populated state.user["roles"] from request claims.

Tighten by allowlisting auth_method ∈ {synthetic, sam_access_token}.
Other paths fall through to the existing role-provider lookup. No
functional change for current callers.
@sonarqube-solacecloud
Copy link
Copy Markdown

@JKaram JKaram requested a review from macstewart May 7, 2026 13:22
@JKaram JKaram changed the title feat(DATAGO-130042): synthetic-monitor auth path for Datadog smoke feat(DATAGO-130042): synthetic-monitor auth path for Datadog smoke tests May 7, 2026
@JKaram JKaram merged commit 9a76c0f into main May 11, 2026
29 of 31 checks passed
@JKaram JKaram deleted the JKaram/DATAGO-130042/synthetic-monitor-auth branch May 11, 2026 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants