feat(service-map): server-side filtering, latency percentiles, throughput & focus by MikeShi42 · Pull Request #2387 · hyperdxio/hyperdx

MikeShi42 · 2026-05-30T18:32:44Z

What & why

The Service Map showed topology plus only part of RED (request counts + error
rate). This brings it closer to a full APM-style service map: filter large maps
down to what you care about, see latency (the missing D in RED) and
throughput, and pivot/inspect a service.

Commit 1 — Server-side filtering

A Lucene/SQL where input and a service-name MultiSelect scope which
spans/services the map is built from, with inbound/outbound neighbor
expansion so a focused service still shows its immediate dependencies.

Commit 2 — Latency percentiles, throughput, focus & node sizing

p50 / p95 / p99 latency per node and per edge, computed server-side via a
single GROUPING SETS query that emits both rolled-up node-level rows and
per-edge rows (percentiles can't be combined client-side). Errors now come
from countIf, dropping the old per-status map.
Throughput (req/s), labeled, shown alongside percentiles in tooltips.
A Focus action in the node tooltip drives the service filter to that
service and its immediate neighbors (clicking again clears it). Clicking a
node just selects it / opens the tooltip — it does not auto-focus.
Node size scales with total throughput (incoming + outgoing requests).

Testing

68 unit tests covering the aggregation and the pure helpers
(rawDurationToMs, getRequestsPerSecond, formatRate, getNodeSize).
tsc, eslint, and stylelint are clean.
The GROUPING SETS query was executed against a live ClickHouse to confirm it
parses/runs (GROUPING(), countIf, quantile).
Not yet visually validated against real trace data.

Notes for reviewers

Edge latency is the server-side processing time of calls from that caller
(server spans grouped by caller), not client-observed round-trip latency.
Percentiles are approximate (computed over sampled spans with approximate
quantiles) and are marked with a ~ in the UI.
DBServiceMapPage.tsx is now ~390 lines; extracting the filter bar into its
own component is a reasonable follow-up.

🤖 Generated with Claude Code

…ce MultiSelect + neighbor expansion)

…de sizing Builds on the server-side filtering to make the map a real RED-metrics view: - Latency p50/p95/p99 computed server-side via a single GROUPING SETS query that emits both rolled-up node-level rows and per-edge rows (percentiles can't be combined client-side). Errors via countIf; drops the per-status map. - Labeled request throughput (req/s) shown alongside percentiles in node and edge tooltips. - 'Focus' action in the node tooltip drives the server-side service filter to that service and its immediate neighbors (toggles off when reselected); clicking a node selects it and reveals the tooltip rather than auto-focusing. - Node size scales with throughput (sqrt of incoming volume vs the busiest node). - Pure unit-tested helpers: rawDurationToMs, getRequestsPerSecond, formatRate, getNodeSize; aggregation tests rewritten for the new row model.

changeset-bot · 2026-05-30T18:32:47Z

🦋 Changeset detected

Latest commit: bd20dba

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages

Name	Type
@hyperdx/app	Patch
@hyperdx/api	Patch
@hyperdx/otel-collector	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel · 2026-05-30T18:32:49Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hyperdx-oss	Ready	Preview, Comment	May 30, 2026 7:58pm
hyperdx-storybook	Ready	Preview, Comment	May 30, 2026 7:58pm

github-actions · 2026-05-30T18:33:07Z

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

Large diff: 790 production lines changed (threshold: 400)

Additional context: agent branch (claude/service-map-integrated)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats

Production files changed: 7
Production lines changed: 790 (+ 758 in test files, excluded from tier calculation)
Branch: claude/service-map-integrated
Author: MikeShi42

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

github-actions · 2026-05-30T18:40:08Z

E2E Test Results

✅ All tests passed • 192 passed • 3 skipped • 1321s

Status	Count
✅ Passed	192
❌ Failed	0
⚠️ Flaky	3
⏭️ Skipped	3

Tests ran across 4 shards in parallel.

View full report →

github-actions · 2026-05-30T19:14:40Z

Deep Review

✅ No critical issues found.

🟡 P2 — recommended

packages/app/src/hooks/useServiceMap.tsx:237 — hasLatency: row.p50 != null lets NaN through because Number('nan') is NaN and NaN != null, so when ClickHouse quantiles() returns 'nan' for an empty group the tooltip will render p50 ~NaNh · p95 ~NaNh · p99 ~NaNh.
- Fix: Require Number.isFinite(row.p50) for hasLatency, and convert non-finite array entries to undefined when reading quantiles[i].
- _{correctness, reliability, adversarial, kieran-typescript}
packages/app/src/components/ServiceMap/utils.ts:155 — source.durationPrecision ?? 3 disagrees with the codebase default of 9 used in source.ts and serviceDashboard.ts; if a source ever reaches this path without durationPrecision, raw nanoseconds are treated as milliseconds and every latency renders 1,000,000× too large.
- Fix: Change the fallback to ?? 9 to match the established default, or fail loudly if the field is genuinely required upstream.
- _{maintainability}
packages/app/src/hooks/useServiceMap.tsx:144 — With server-side service filtering, the WHERE (ServerSpans.serviceName IN (...) OR ClientSpans.serviceName IN (...)) runs before GROUPING SETS, so a neighbor node's rolled-up tooltip shows totals/percentiles scoped to the focused service only — labeled identically to its full-graph stats.
- Fix: Disambiguate the tooltip when a service filter is active (label the rollup as scoped) or suppress the per-neighbor node tooltip and only show the per-edge view.
- _{correctness, adversarial}
packages/app/src/DBServiceMapPage.tsx:78 — parseAsArrayOf(parseAsString) joins on , by default, so service names containing a comma round-trip as two separate entries; Focus on such a service writes a URL that reloads as a multi-service filter for non-existent names, and the map shows empty.
- Fix: Configure parseAsArrayOf with a separator illegal in service names (e.g. |), or encode each element.
- _adversarial
packages/app/src/components/ServiceMap/ServiceMap.tsx:125 — The layout useEffect depends on services (a fresh Map per react-query success), onFocusService, and other identity-unstable values, so dagre re-runs and the companion fitView effect re-fits the camera every time those identities flip — causing visible camera jumps on filter change even when topology is unchanged.
- Fix: Memoize the rebuilt nodes/edges by a content hash of the services map and skip the layout call when nothing changed; read onFocusService via a ref to keep it out of the dep array.
- _{performance, adversarial}
packages/app/src/hooks/useServiceMap.tsx:147 — The serviceName filter is applied in the outer WHERE, after the CTEs have already scanned the full time-range; pushing the IN-list into baseCTEConfig.filters would let ClickHouse prune partitions and skip rows at scan time.
- Fix: Add the serverServiceName IN (...) predicate inside baseCTEConfig.filters (keeping the OR-based neighbor expansion in the outer WHERE).
- _performance
packages/app/src/DBServiceMapPage.tsx:178 — The new onFocusService toggle (three branches: focus unfocused, re-click focused → clear, click while multi-selected → collapse) has no unit or component test, and the toggle silently destroys a prior multi-select selection with no way to recover it.
- Fix: Extract the toggle into a pure helper and unit-test the three branches; consider preserving prior selection or disabling Focus when more than one service is selected.
- _{correctness, testing, maintainability, adversarial}
packages/app/src/components/ServiceMap/ServiceMapTooltip.tsx:80 — Four new rendering branches (the showMetrics guard, the requestsPerSecond row, the latencyMs percentile row, the Focus button) are entirely uncovered by component tests, so any regression in deriveDisplayMetrics flow-through goes undetected.
- Fix: Add render tests covering latencyMs/requestsPerSecond present vs absent, totalRequests === 0 suppression, and onFocus present vs absent.
- _testing
packages/app/src/hooks/useServiceMap.tsx:144 — The new serviceNameFilter SQL fragment (including the IN-list interpolation via UNSAFE_RAW_SQL and the neighbor-expansion OR) has no test at the query-generation level; a future refactor could silently drop the filter or break neighbor expansion.
- Fix: Add a unit test on getServiceMapQuery asserting the rendered SQL contains the escaped IN list and the OR-pattern when serviceNames is non-empty.
- _{testing, performance, project-standards}

🔵 P3 nitpicks (15)

packages/app/src/components/ServiceMap/utils.ts:117 — perSecond.toFixed(2) renders any rate below 0.005 as '0.00 req/s', indistinguishable from no traffic even though the same tooltip shows a non-zero request count.
- Fix: Render '<0.01 req/s' for sub-0.01 rates, or switch the unit to req/min for low rates.
- _correctness
packages/app/src/hooks/useServiceMap.tsx:145 — Service names are passed through SqlString.escape (MySQL semantics) and interpolated via UNSAFE_RAW_SQL; the surrounding code uses the same pattern, but switching to a chSql { String: name } parameter eliminates the unsafe-raw hop entirely.
- Fix: Build the IN list with chSql's typed string placeholder per element instead of SqlString.escape.
- _{security, adversarial}
packages/app/src/hooks/useServiceMap.tsx:145 — useGetKeyValues is configured with limit: 10000, disableRowLimit: true, so a URL or future "Select All" affordance can produce an IN list that overflows ClickHouse's default max_query_size (256 KB) and surfaces a confusing parse error.
- Fix: Cap serviceNames.length defensively (e.g. 200) before building the IN list and warn on truncation.
- _adversarial
packages/app/src/components/ServiceMap/ServiceMap.tsx:40 — NODE_SIZE = 80 is the fixed dagre collision box, but rendered circles now scale 32–60px, so dagre allocates conservative padding that no longer matches the rendered geometry.
- Fix: Pass per-node size into getGraphLayout, or add a comment noting the constant is intentionally a conservative upper bound.
- _{maintainability}
packages/app/src/components/ServiceMap/ServiceMap.tsx:136 — onFocusService is threaded through four component layers (page → ServiceMap → ServiceMapPresentation → ServiceMapNode) for a callback used only at the leaf.
- Fix: Provide it via a React context scoped to the ReactFlowProvider subtree.
- _{maintainability}
packages/app/src/components/ServiceMap/utils.ts:130 — DisplayStats duplicates the field set of IncomingRequestStats and will silently drift when fields are added.
- Fix: Replace DisplayStats with Pick<IncomingRequestStats, 'totalRequests' | 'p50' | 'p95' | 'p99' | 'hasLatency'>.
- _{maintainability, kieran-typescript}
packages/app/src/components/ServiceMap/ServiceMap.tsx:119 — The maxThroughput formula incomingRequests.totalRequests + outgoingRequests is repeated at the per-node getNodeSize call site, so the two must be kept in sync by hand.
- Fix: Extract a totalThroughput(agg) helper into utils.ts and use it in both places.
- _{kieran-typescript}
packages/app/src/hooks/useServiceMap.tsx:350 — Several new as string / as unknown[] casts on the raw row push runtime trust onto the call site without a parse step.
- Fix: Define a discriminated row type for GROUPING SETS output and parse via a schema (Zod) or a narrow guard helper.
- _{project-standards, kieran-typescript}
packages/app/src/hooks/useServiceMap.tsx:346 — data.data.map(...) throws an uninformative TypeError if ClickHouse returns a JSON body lacking the data array; the error surfaces but the message obscures the upstream cause.
- Fix: Use (data.data ?? []).map(...) and surface data.exception when present.
- _reliability
packages/app/src/DBServiceMapPage.tsx:189 — handleSubmit(handler)() returns a Promise that is not awaited, so any form validation rejection is silently swallowed.
- Fix: Make the onSearch callback async and await the returned promise.
- _{kieran-typescript}
packages/app/src/DBServiceMapPage.tsx:391, packages/app/src/hooks/useServiceMap.tsx:375, packages/app/src/components/ServiceMap/ServiceMap.tsx:324 — Three files now exceed the project's 300-line guideline; the page file's filter bar and the hook's getServiceMapQuery are natural extraction seams.
- Fix: Extract DBServiceMapFilters, move getServiceMapQuery into its own module, and split ServiceMapPresentation into a sibling file.
- _{project-standards}
.changeset/service-map-percentiles-filtering.md:2 — The change bumps exported types (SpanAggregationRow, IncomingRequestStats, ServiceAggregation, ServiceMapEdgeData, ServiceMapNodeData) but is declared a patch.
- Fix: Bump the changeset to minor to signal the exported-type shape changes.
- _api-contract
packages/app/src/DBServiceMapPage.tsx:105 — useForm({ values: { whereLanguage: searchedConfig.whereLanguage ?? getStoredLanguage() ?? 'lucene' } }) calls getStoredLanguage() on every render; a cross-tab localStorage flip would reset the in-progress form language under react-hook-form's values-sync semantics.
- Fix: Resolve the initial language once via useMemo so the values prop is stable across renders.
- _adversarial
packages/app/src/components/ServiceMap/ServiceMapNode.tsx:53 — A two-line comment about the duration-precision fallback now sits at a call site where the fallback no longer lives (it was moved into deriveDisplayMetrics), and the precision it cites disagrees with the actual default.
- Fix: Delete the stale comment.
- _{maintainability}
packages/api/src/mcp/mcpServer.ts — The new Service Map capabilities (Lucene/SQL where, service filter with neighbor expansion, percentile/throughput readout, Focus action) widen a pre-existing gap: no MCP tool exposes the topology query, so agents can only reconstruct it via hyperdx_sql.
- Fix: File a follow-up to add a hyperdx_service_map MCP tool that wraps getServiceMapQuery; not blocking this PR.
- _agent-native

Reviewers (12): correctness, testing, maintainability, project-standards, agent-native, learnings, security, performance, api-contract, reliability, adversarial, kieran-typescript.

Testing gaps:

No assertion that the generated SQL contains the serviceName IN (...) filter or the OR neighbor-expansion pattern.
No coverage of ClickHouse returning NaN/null quantile elements (the path that produces 'NaNh'/'0ms' tooltips).
No render tests for ServiceMapTooltip, ServiceMapNode, or ServiceMapEdge covering the new percentile/throughput/Focus branches.
No round-trip test for the services, where, and whereLanguage URL query parameters (Lucene strings with spaces, +, quotes; comma-containing service names).
No coverage of the onFocusService three-way toggle behavior at the component or page level.

- rawDurationToMs: drop the exponent clamp so precision < 3 (e.g. seconds) scales up correctly, matching getDurationSecondsExpression (P2 correctness). - Align durationPrecision fallback to the schema default (3) in node/edge. - Type isNodeLevel as boolean, converting at the parse boundary. - formatRate: guard non-finite/negative input. - Tests: precision<3 conversion, formatRate guards, partial-percentile hasLatency case.

MikeShi42 · 2026-05-30T19:28:07Z

Thanks — addressed the deep-review feedback in c9be6f8.

Fixed:

P2 (correctness) rawDurationToMs clamp — dropped the Math.max(0, …) so a source with durationPrecision < 3 (e.g. seconds) scales up to ms correctly, matching getDurationSecondsExpression. Updated the test that locked in the old behavior.
P2 durationPrecision ?? 9 — aligned the fallback to the schema default (3) in both node and edge.
P3 isNodeLevel: number — now boolean, converted at the parse boundary so the SQL GROUPING() shape doesn't leak into the row type.
P3 formatRate — guards non-finite/negative input (returns 0 req/s).
Testing — added the statsFromRow partial-percentile case (p50 set, p95/p99 absent → hasLatency: true, zeros), the precision-<3 conversion, and formatRate guard cases.

Deferring (with reasons):

File splits (DBServiceMapPage.tsx ~391 lines, useServiceMap.tsx query+aggregator+hook) and the deriveTooltipMetrics dedup — legitimate; I'd rather do these as a focused follow-up (extract ServiceMapFilterBar/useServiceMapFilters, move getServiceMapQuery to its own module) than expand this PR's diff. Happy to do it here if you'd prefer.
Query-builder SQL test + tooltip/focus render tests — worth adding; planning them with the extraction above (a standalone serviceMapQuery.ts is much easier to unit-test). I did validate the GROUPING SETS query executes against a live ClickHouse.
IncomingRequestStats rename — reasonable, but it's referenced widely; folding into the refactor follow-up.
Redundant as 'sql' | 'lucene' / unreachable durationExpression guards — these are in the pre-existing filtering code and the reviewer flagged the latter as "not a blocker"; leaving to keep churn down.

…tion Second-pass deep-review follow-ups: - Use quantiles(0.5,0.95,0.99) (single reservoir sketch) instead of three quantile() calls; parser reads the [p50,p95,p99] array. Validated against CH. - rawDurationToMs: multiply by 10^(3-precision) to avoid a fractional divisor for precision < 3. - Extract deriveDisplayMetrics() shared by node and edge tooltips (removes the duplicated precision/latency/throughput derivation). - parseInt with explicit radix 10; drop a redundant whereLanguage cast.

MikeShi42 · 2026-05-30T19:56:11Z

Second-pass review — addressed the bounded items in bd20dba:

Fixed:

Perf: quantile() ×3 → quantiles(0.5,0.95,0.99) (single reservoir sketch); parser reads the array. Validated the new SQL against a live ClickHouse.
rawDurationToMs: multiply by 10^(3-precision) to avoid the fractional-divisor drift for precision < 3.
Dedup: extracted deriveDisplayMetrics() shared by node + edge tooltips.
Nits: parseInt(…, 10) radix; dropped the redundant whereLanguage cast.

Deferring — these need a decision, not an auto-fix:

Focus scoping (P2, top item): correct catch — with a serviceNames filter, a neighbor's rolled-up (server) row only aggregates the focused slice, so neighbor node totals are scoped to the focus. This is inherent to the cursor branch's server-side filtering. The three fixes (drop predicate + client-side post-filter / move to HAVING / label "scoped to focus") have different product implications — I'd like a maintainer call before changing filter semantics. Leaning toward "surface scoped-to-focus in the tooltip" as least-surprising.
Empty serviceName → (unknown) edge (P2): introduces a sentinel-node concept for a rare misconfigured-collector case; worth doing but it's a product decision.
useForm controlled reset discarding mid-typed where (P2): real; needs defaultValues + a guarded reset(..., { keepDirtyValues: true }) and its own test — deferring so it gets proper testing.

Batched into a focused follow-up (keeps this PR's diff reviewable): file splits (ServiceMapFilters component, serviceMapQuery.ts module — all 3 flagged files are over 300 lines), the query-builder/parser/focus-toggle/tooltip-render test suites, and the IncomingRequestStats rename.

Noted, not changing:

The "GROUPING SETS row-ordering invariant" testing-gap is a non-issue: node rows write incomingRequests and edge rows write incomingRequestsByClient (disjoint fields), and there's exactly one node row per service — order can't clobber totals.
UNSAFE_RAW_SQL+SqlString.escape, countIf('Error') schema assumption, layout fitView/pan-zoom preservation, MultiSelect-disabled-on-refetch, BigInt counts, JSDoc trimming — pre-existing-pattern or low-severity; capturing in the follow-up list.

MikeShi42 added 2 commits May 30, 2026 18:30

feat: server-side filtering for Service Map (Lucene/SQL where + servi…

7fa4839

…ce MultiSelect + neighbor expansion)

github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 30, 2026

vercel Bot deployed to Preview – hyperdx-storybook May 30, 2026 18:34 View deployment

vercel Bot deployed to Preview – hyperdx-oss May 30, 2026 18:35 View deployment

chore(service-map): add changeset

66d1b3e

vercel Bot deployed to Preview – hyperdx-storybook May 30, 2026 19:29 View deployment

vercel Bot deployed to Preview – hyperdx-oss May 30, 2026 19:31 View deployment

vercel Bot deployed to Preview – hyperdx-storybook May 30, 2026 19:57 View deployment

vercel Bot deployed to Preview – hyperdx-oss May 30, 2026 19:58 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(service-map): server-side filtering, latency percentiles, throughput & focus#2387

feat(service-map): server-side filtering, latency percentiles, throughput & focus#2387
MikeShi42 wants to merge 5 commits into
mainfrom
claude/service-map-integrated

MikeShi42 commented May 30, 2026

Uh oh!

changeset-bot Bot commented May 30, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 30, 2026 •

edited

Loading

Uh oh!

MikeShi42 commented May 30, 2026

Uh oh!

MikeShi42 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MikeShi42 commented May 30, 2026

What & why

Commit 1 — Server-side filtering

Commit 2 — Latency percentiles, throughput, focus & node sizing

Testing

Notes for reviewers

Uh oh!

changeset-bot Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

vercel Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Tier 4 — Critical

Uh oh!

github-actions Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

github-actions Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deep Review

🟡 P2 — recommended

Uh oh!

MikeShi42 commented May 30, 2026

Uh oh!

MikeShi42 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented May 30, 2026 •

edited

Loading

vercel Bot commented May 30, 2026 •

edited

Loading

github-actions Bot commented May 30, 2026 •

edited

Loading

github-actions Bot commented May 30, 2026 •

edited

Loading

github-actions Bot commented May 30, 2026 •

edited

Loading