Skip to content

feat(service-map): server-side filtering, latency percentiles, throughput & focus#2387

Open
MikeShi42 wants to merge 5 commits into
mainfrom
claude/service-map-integrated
Open

feat(service-map): server-side filtering, latency percentiles, throughput & focus#2387
MikeShi42 wants to merge 5 commits into
mainfrom
claude/service-map-integrated

Conversation

@MikeShi42
Copy link
Copy Markdown
Contributor

What & why

The Service Map showed topology plus only part of RED (request counts + error
rate). This brings it closer to a full APM-style service map: filter large maps
down to what you care about, see latency (the missing D in RED) and
throughput, and pivot/inspect a service.

Commit 1 — Server-side filtering

  • A Lucene/SQL where input and a service-name MultiSelect scope which
    spans/services the map is built from, with inbound/outbound neighbor
    expansion so a focused service still shows its immediate dependencies.

Commit 2 — Latency percentiles, throughput, focus & node sizing

  • p50 / p95 / p99 latency per node and per edge, computed server-side via a
    single GROUPING SETS query that emits both rolled-up node-level rows and
    per-edge rows (percentiles can't be combined client-side). Errors now come
    from countIf, dropping the old per-status map.
  • Throughput (req/s), labeled, shown alongside percentiles in tooltips.
  • A Focus action in the node tooltip drives the service filter to that
    service and its immediate neighbors (clicking again clears it). Clicking a
    node just selects it / opens the tooltip — it does not auto-focus.
  • Node size scales with total throughput (incoming + outgoing requests).

Testing

  • 68 unit tests covering the aggregation and the pure helpers
    (rawDurationToMs, getRequestsPerSecond, formatRate, getNodeSize).
  • tsc, eslint, and stylelint are clean.
  • The GROUPING SETS query was executed against a live ClickHouse to confirm it
    parses/runs (GROUPING(), countIf, quantile).
  • Not yet visually validated against real trace data.

Notes for reviewers

  • Edge latency is the server-side processing time of calls from that caller
    (server spans grouped by caller), not client-observed round-trip latency.
  • Percentiles are approximate (computed over sampled spans with approximate
    quantiles) and are marked with a ~ in the UI.
  • DBServiceMapPage.tsx is now ~390 lines; extracting the filter bar into its
    own component is a reasonable follow-up.

🤖 Generated with Claude Code

MikeShi42 added 2 commits May 30, 2026 18:30
…de sizing

Builds on the server-side filtering to make the map a real RED-metrics view:

- Latency p50/p95/p99 computed server-side via a single GROUPING SETS query
  that emits both rolled-up node-level rows and per-edge rows (percentiles
  can't be combined client-side). Errors via countIf; drops the per-status map.
- Labeled request throughput (req/s) shown alongside percentiles in node and
  edge tooltips.
- 'Focus' action in the node tooltip drives the server-side service filter to
  that service and its immediate neighbors (toggles off when reselected);
  clicking a node selects it and reveals the tooltip rather than auto-focusing.
- Node size scales with throughput (sqrt of incoming volume vs the busiest
  node).
- Pure unit-tested helpers: rawDurationToMs, getRequestsPerSecond, formatRate,
  getNodeSize; aggregation tests rewritten for the new row model.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 30, 2026

🦋 Changeset detected

Latest commit: bd20dba

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@hyperdx/app Patch
@hyperdx/api Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented May 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 30, 2026 7:58pm
hyperdx-storybook Ready Ready Preview, Comment May 30, 2026 7:58pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 30, 2026

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Large diff: 790 production lines changed (threshold: 400)

Additional context: agent branch (claude/service-map-integrated)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 7
  • Production lines changed: 790 (+ 758 in test files, excluded from tier calculation)
  • Branch: claude/service-map-integrated
  • Author: MikeShi42

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 30, 2026

E2E Test Results

All tests passed • 192 passed • 3 skipped • 1321s

Status Count
✅ Passed 192
❌ Failed 0
⚠️ Flaky 3
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 30, 2026

Deep Review

✅ No critical issues found.

🟡 P2 — recommended

  • packages/app/src/hooks/useServiceMap.tsx:237hasLatency: row.p50 != null lets NaN through because Number('nan') is NaN and NaN != null, so when ClickHouse quantiles() returns 'nan' for an empty group the tooltip will render p50 ~NaNh · p95 ~NaNh · p99 ~NaNh.
    • Fix: Require Number.isFinite(row.p50) for hasLatency, and convert non-finite array entries to undefined when reading quantiles[i].
    • correctness, reliability, adversarial, kieran-typescript
  • packages/app/src/components/ServiceMap/utils.ts:155source.durationPrecision ?? 3 disagrees with the codebase default of 9 used in source.ts and serviceDashboard.ts; if a source ever reaches this path without durationPrecision, raw nanoseconds are treated as milliseconds and every latency renders 1,000,000× too large.
    • Fix: Change the fallback to ?? 9 to match the established default, or fail loudly if the field is genuinely required upstream.
    • maintainability
  • packages/app/src/hooks/useServiceMap.tsx:144 — With server-side service filtering, the WHERE (ServerSpans.serviceName IN (...) OR ClientSpans.serviceName IN (...)) runs before GROUPING SETS, so a neighbor node's rolled-up tooltip shows totals/percentiles scoped to the focused service only — labeled identically to its full-graph stats.
    • Fix: Disambiguate the tooltip when a service filter is active (label the rollup as scoped) or suppress the per-neighbor node tooltip and only show the per-edge view.
    • correctness, adversarial
  • packages/app/src/DBServiceMapPage.tsx:78parseAsArrayOf(parseAsString) joins on , by default, so service names containing a comma round-trip as two separate entries; Focus on such a service writes a URL that reloads as a multi-service filter for non-existent names, and the map shows empty.
    • Fix: Configure parseAsArrayOf with a separator illegal in service names (e.g. |), or encode each element.
    • adversarial
  • packages/app/src/components/ServiceMap/ServiceMap.tsx:125 — The layout useEffect depends on services (a fresh Map per react-query success), onFocusService, and other identity-unstable values, so dagre re-runs and the companion fitView effect re-fits the camera every time those identities flip — causing visible camera jumps on filter change even when topology is unchanged.
    • Fix: Memoize the rebuilt nodes/edges by a content hash of the services map and skip the layout call when nothing changed; read onFocusService via a ref to keep it out of the dep array.
    • performance, adversarial
  • packages/app/src/hooks/useServiceMap.tsx:147 — The serviceName filter is applied in the outer WHERE, after the CTEs have already scanned the full time-range; pushing the IN-list into baseCTEConfig.filters would let ClickHouse prune partitions and skip rows at scan time.
    • Fix: Add the serverServiceName IN (...) predicate inside baseCTEConfig.filters (keeping the OR-based neighbor expansion in the outer WHERE).
    • performance
  • packages/app/src/DBServiceMapPage.tsx:178 — The new onFocusService toggle (three branches: focus unfocused, re-click focused → clear, click while multi-selected → collapse) has no unit or component test, and the toggle silently destroys a prior multi-select selection with no way to recover it.
    • Fix: Extract the toggle into a pure helper and unit-test the three branches; consider preserving prior selection or disabling Focus when more than one service is selected.
    • correctness, testing, maintainability, adversarial
  • packages/app/src/components/ServiceMap/ServiceMapTooltip.tsx:80 — Four new rendering branches (the showMetrics guard, the requestsPerSecond row, the latencyMs percentile row, the Focus button) are entirely uncovered by component tests, so any regression in deriveDisplayMetrics flow-through goes undetected.
    • Fix: Add render tests covering latencyMs/requestsPerSecond present vs absent, totalRequests === 0 suppression, and onFocus present vs absent.
    • testing
  • packages/app/src/hooks/useServiceMap.tsx:144 — The new serviceNameFilter SQL fragment (including the IN-list interpolation via UNSAFE_RAW_SQL and the neighbor-expansion OR) has no test at the query-generation level; a future refactor could silently drop the filter or break neighbor expansion.
    • Fix: Add a unit test on getServiceMapQuery asserting the rendered SQL contains the escaped IN list and the OR-pattern when serviceNames is non-empty.
    • testing, performance, project-standards
🔵 P3 nitpicks (15)
  • packages/app/src/components/ServiceMap/utils.ts:117perSecond.toFixed(2) renders any rate below 0.005 as '0.00 req/s', indistinguishable from no traffic even though the same tooltip shows a non-zero request count.
    • Fix: Render '<0.01 req/s' for sub-0.01 rates, or switch the unit to req/min for low rates.
    • correctness
  • packages/app/src/hooks/useServiceMap.tsx:145 — Service names are passed through SqlString.escape (MySQL semantics) and interpolated via UNSAFE_RAW_SQL; the surrounding code uses the same pattern, but switching to a chSql { String: name } parameter eliminates the unsafe-raw hop entirely.
    • Fix: Build the IN list with chSql's typed string placeholder per element instead of SqlString.escape.
    • security, adversarial
  • packages/app/src/hooks/useServiceMap.tsx:145useGetKeyValues is configured with limit: 10000, disableRowLimit: true, so a URL or future "Select All" affordance can produce an IN list that overflows ClickHouse's default max_query_size (256 KB) and surfaces a confusing parse error.
    • Fix: Cap serviceNames.length defensively (e.g. 200) before building the IN list and warn on truncation.
    • adversarial
  • packages/app/src/components/ServiceMap/ServiceMap.tsx:40NODE_SIZE = 80 is the fixed dagre collision box, but rendered circles now scale 32–60px, so dagre allocates conservative padding that no longer matches the rendered geometry.
    • Fix: Pass per-node size into getGraphLayout, or add a comment noting the constant is intentionally a conservative upper bound.
    • maintainability
  • packages/app/src/components/ServiceMap/ServiceMap.tsx:136onFocusService is threaded through four component layers (page → ServiceMap → ServiceMapPresentation → ServiceMapNode) for a callback used only at the leaf.
    • Fix: Provide it via a React context scoped to the ReactFlowProvider subtree.
    • maintainability
  • packages/app/src/components/ServiceMap/utils.ts:130DisplayStats duplicates the field set of IncomingRequestStats and will silently drift when fields are added.
    • Fix: Replace DisplayStats with Pick<IncomingRequestStats, 'totalRequests' | 'p50' | 'p95' | 'p99' | 'hasLatency'>.
    • maintainability, kieran-typescript
  • packages/app/src/components/ServiceMap/ServiceMap.tsx:119 — The maxThroughput formula incomingRequests.totalRequests + outgoingRequests is repeated at the per-node getNodeSize call site, so the two must be kept in sync by hand.
    • Fix: Extract a totalThroughput(agg) helper into utils.ts and use it in both places.
    • kieran-typescript
  • packages/app/src/hooks/useServiceMap.tsx:350 — Several new as string / as unknown[] casts on the raw row push runtime trust onto the call site without a parse step.
    • Fix: Define a discriminated row type for GROUPING SETS output and parse via a schema (Zod) or a narrow guard helper.
    • project-standards, kieran-typescript
  • packages/app/src/hooks/useServiceMap.tsx:346data.data.map(...) throws an uninformative TypeError if ClickHouse returns a JSON body lacking the data array; the error surfaces but the message obscures the upstream cause.
    • Fix: Use (data.data ?? []).map(...) and surface data.exception when present.
    • reliability
  • packages/app/src/DBServiceMapPage.tsx:189handleSubmit(handler)() returns a Promise that is not awaited, so any form validation rejection is silently swallowed.
    • Fix: Make the onSearch callback async and await the returned promise.
    • kieran-typescript
  • packages/app/src/DBServiceMapPage.tsx:391, packages/app/src/hooks/useServiceMap.tsx:375, packages/app/src/components/ServiceMap/ServiceMap.tsx:324 — Three files now exceed the project's 300-line guideline; the page file's filter bar and the hook's getServiceMapQuery are natural extraction seams.
    • Fix: Extract DBServiceMapFilters, move getServiceMapQuery into its own module, and split ServiceMapPresentation into a sibling file.
    • project-standards
  • .changeset/service-map-percentiles-filtering.md:2 — The change bumps exported types (SpanAggregationRow, IncomingRequestStats, ServiceAggregation, ServiceMapEdgeData, ServiceMapNodeData) but is declared a patch.
    • Fix: Bump the changeset to minor to signal the exported-type shape changes.
    • api-contract
  • packages/app/src/DBServiceMapPage.tsx:105useForm({ values: { whereLanguage: searchedConfig.whereLanguage ?? getStoredLanguage() ?? 'lucene' } }) calls getStoredLanguage() on every render; a cross-tab localStorage flip would reset the in-progress form language under react-hook-form's values-sync semantics.
    • Fix: Resolve the initial language once via useMemo so the values prop is stable across renders.
    • adversarial
  • packages/app/src/components/ServiceMap/ServiceMapNode.tsx:53 — A two-line comment about the duration-precision fallback now sits at a call site where the fallback no longer lives (it was moved into deriveDisplayMetrics), and the precision it cites disagrees with the actual default.
    • Fix: Delete the stale comment.
    • maintainability
  • packages/api/src/mcp/mcpServer.ts — The new Service Map capabilities (Lucene/SQL where, service filter with neighbor expansion, percentile/throughput readout, Focus action) widen a pre-existing gap: no MCP tool exposes the topology query, so agents can only reconstruct it via hyperdx_sql.
    • Fix: File a follow-up to add a hyperdx_service_map MCP tool that wraps getServiceMapQuery; not blocking this PR.
    • agent-native

Reviewers (12): correctness, testing, maintainability, project-standards, agent-native, learnings, security, performance, api-contract, reliability, adversarial, kieran-typescript.

Testing gaps:

  • No assertion that the generated SQL contains the serviceName IN (...) filter or the OR neighbor-expansion pattern.
  • No coverage of ClickHouse returning NaN/null quantile elements (the path that produces 'NaNh'/'0ms' tooltips).
  • No render tests for ServiceMapTooltip, ServiceMapNode, or ServiceMapEdge covering the new percentile/throughput/Focus branches.
  • No round-trip test for the services, where, and whereLanguage URL query parameters (Lucene strings with spaces, +, quotes; comma-containing service names).
  • No coverage of the onFocusService three-way toggle behavior at the component or page level.

- rawDurationToMs: drop the exponent clamp so precision < 3 (e.g. seconds)
  scales up correctly, matching getDurationSecondsExpression (P2 correctness).
- Align durationPrecision fallback to the schema default (3) in node/edge.
- Type isNodeLevel as boolean, converting at the parse boundary.
- formatRate: guard non-finite/negative input.
- Tests: precision<3 conversion, formatRate guards, partial-percentile
  hasLatency case.
@MikeShi42
Copy link
Copy Markdown
Contributor Author

Thanks — addressed the deep-review feedback in c9be6f8.

Fixed:

  • P2 (correctness) rawDurationToMs clamp — dropped the Math.max(0, …) so a source with durationPrecision < 3 (e.g. seconds) scales up to ms correctly, matching getDurationSecondsExpression. Updated the test that locked in the old behavior.
  • P2 durationPrecision ?? 9 — aligned the fallback to the schema default (3) in both node and edge.
  • P3 isNodeLevel: number — now boolean, converted at the parse boundary so the SQL GROUPING() shape doesn't leak into the row type.
  • P3 formatRate — guards non-finite/negative input (returns 0 req/s).
  • Testing — added the statsFromRow partial-percentile case (p50 set, p95/p99 absent → hasLatency: true, zeros), the precision-<3 conversion, and formatRate guard cases.

Deferring (with reasons):

  • File splits (DBServiceMapPage.tsx ~391 lines, useServiceMap.tsx query+aggregator+hook) and the deriveTooltipMetrics dedup — legitimate; I'd rather do these as a focused follow-up (extract ServiceMapFilterBar/useServiceMapFilters, move getServiceMapQuery to its own module) than expand this PR's diff. Happy to do it here if you'd prefer.
  • Query-builder SQL test + tooltip/focus render tests — worth adding; planning them with the extraction above (a standalone serviceMapQuery.ts is much easier to unit-test). I did validate the GROUPING SETS query executes against a live ClickHouse.
  • IncomingRequestStats rename — reasonable, but it's referenced widely; folding into the refactor follow-up.
  • Redundant as 'sql' | 'lucene' / unreachable durationExpression guards — these are in the pre-existing filtering code and the reviewer flagged the latter as "not a blocker"; leaving to keep churn down.

…tion

Second-pass deep-review follow-ups:
- Use quantiles(0.5,0.95,0.99) (single reservoir sketch) instead of three
  quantile() calls; parser reads the [p50,p95,p99] array. Validated against CH.
- rawDurationToMs: multiply by 10^(3-precision) to avoid a fractional divisor
  for precision < 3.
- Extract deriveDisplayMetrics() shared by node and edge tooltips (removes the
  duplicated precision/latency/throughput derivation).
- parseInt with explicit radix 10; drop a redundant whereLanguage cast.
@MikeShi42
Copy link
Copy Markdown
Contributor Author

Second-pass review — addressed the bounded items in bd20dba:

Fixed:

  • Perf: quantile() ×3 → quantiles(0.5,0.95,0.99) (single reservoir sketch); parser reads the array. Validated the new SQL against a live ClickHouse.
  • rawDurationToMs: multiply by 10^(3-precision) to avoid the fractional-divisor drift for precision < 3.
  • Dedup: extracted deriveDisplayMetrics() shared by node + edge tooltips.
  • Nits: parseInt(…, 10) radix; dropped the redundant whereLanguage cast.

Deferring — these need a decision, not an auto-fix:

  • Focus scoping (P2, top item): correct catch — with a serviceNames filter, a neighbor's rolled-up (server) row only aggregates the focused slice, so neighbor node totals are scoped to the focus. This is inherent to the cursor branch's server-side filtering. The three fixes (drop predicate + client-side post-filter / move to HAVING / label "scoped to focus") have different product implications — I'd like a maintainer call before changing filter semantics. Leaning toward "surface scoped-to-focus in the tooltip" as least-surprising.
  • Empty serviceName(unknown) edge (P2): introduces a sentinel-node concept for a rare misconfigured-collector case; worth doing but it's a product decision.
  • useForm controlled reset discarding mid-typed where (P2): real; needs defaultValues + a guarded reset(..., { keepDirtyValues: true }) and its own test — deferring so it gets proper testing.

Batched into a focused follow-up (keeps this PR's diff reviewable): file splits (ServiceMapFilters component, serviceMapQuery.ts module — all 3 flagged files are over 300 lines), the query-builder/parser/focus-toggle/tooltip-render test suites, and the IncomingRequestStats rename.

Noted, not changing:

  • The "GROUPING SETS row-ordering invariant" testing-gap is a non-issue: node rows write incomingRequests and edge rows write incomingRequestsByClient (disjoint fields), and there's exactly one node row per service — order can't clobber totals.
  • UNSAFE_RAW_SQL+SqlString.escape, countIf('Error') schema assumption, layout fitView/pan-zoom preservation, MultiSelect-disabled-on-refetch, BigInt counts, JSDoc trimming — pre-existing-pattern or low-severity; capturing in the follow-up list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant