Skip to content

[multicast] connect MGD and DDM to Omicron#10346

Open
zeeshanlakhani wants to merge 1 commit intozl/multicast-m2p-forwardingfrom
zl/multicast-mgd-ddm
Open

[multicast] connect MGD and DDM to Omicron#10346
zeeshanlakhani wants to merge 1 commit intozl/multicast-m2p-forwardingfrom
zl/multicast-mgd-ddm

Conversation

@zeeshanlakhani
Copy link
Copy Markdown
Collaborator

@zeeshanlakhani zeeshanlakhani commented Apr 30, 2026

Wires MGD (MRIB programming) and DDM (live peer topology for sled-to-switch-port resolution) into the multicast reconciler RPW. The reconciler resolves sled-to-port mapping via DDM peers (primary, live source) and falls back to inventory + DPD backplane when DDM is unavailable. MRIB routes are advertised through MGD and withdrawn when no "Joined" members remain.

Multicast is instance networking under the planned migration of system-level networking from Nexus RPWs to sled-agent reconcilers (omicron#10167).

Sled-side underlay NIC filter programming

  • set_mcast_m2p / clear_mcast_m2p in the OPTE port manager hold UDP sockets joined to the underlay multicast group on each underlay NIC. Joining the group on a held socket triggers mac_multicast_add in the kernel, which programs the per-NIC multicast MAC filter so cxgbe delivers frames to xde. Workaround for opte#908.
  • Eager rehydration at sled-agent startup reopens those filter sockets for M2P entries that survive in xde across a restart. Rehydration failures clear the surviving M2P entry so convergence retries on the next pass instead of black-holing the group.

Switch-zone integration

  • New MulticastSwitchZoneClient fans out per-switch MGD and DDM clients, discovered via internal DNS SRV records. The reconciler uses it for MRIB writes and live peer queries (consuming the ddm-admin-client GET /peers endpoint that returns if_name / port info per peer).
  • ServiceName::Ddm registered in internal DNS via host_zone_switch (now takes a ddm_port) so cross-sled consumers can discover ddmd in switch zones. RSS, the test starter, and overridables_for_test thread the new port through. The multicast reconciler is the first cross-sled consumer; previously, all DdmAdminClient callers were sled-local via DdmAdminClient::localhost.
  • Resolver helper preserves SRV target names alongside resolved sockets, enabling per-target correlation when multiple switch zones share an address but differ by port.

Note: the first reconciler pass after upgrade publishes one new _ddm._tcp SRV record per switch zone, causing a one-time DNS generation bump.

Instance-scoped multicast subscriptions

  • v36 (VERSION_MCAST_M2P_FORWARDING) introduces PUT/DELETE /instances/{instance_id}/multicast-group, replacing the earlier VMM-keyed /vmms/{propolis_id}/multicast-group shape. Sled-agent resolves the active VMM under its instance-state lock and dispatches to OPTE atomically, eliminating a Nexus-side lookup-vs-call race where a migration commit could land subscriptions on a stale propolis.
  • v7 endpoints remain on the trait as deprecated shims that perform the propolis-to-instance lookup and delegate to the new handler.
  • Nexus drops cached_propolis_id and lookup_propolis_id plumbing through the reconciler entirely. subscribe_vmm / unsubscribe_vmm become subscribe_instance / unsubscribe_instance.

Per-pass sled-to-port resolution

Delivers the design captured in the prior TODO: prefer DDM's authoritative view of sled-to-port reachability over inventory, with inventory as cross-validation rather than the primary input.

  • Replaces the previous TTL'd sled-mapping cache with a single-pass amortization built once at the top of the member reconciler pass and threaded through the per-pass reconciler context.
  • DDM peer topology is the primary source. Inventory + DPD backplane is the fallback and supplements partial DDM coverage (per-sled gap-fill) rather than being all-or-nothing.
  • Parsed peer port IDs are cross-validated against the DPD backplane map.
  • Sequential per-switch fallback for shared-state DPD reads (backplane map, underlay group fetch), so a single unhealthy switch can't fail the whole read.

Saga and RPW interaction

  • Saga state guard widened: the DPD-ensure saga accepts "Active" as well as "Creating" so crash-recovery re-execution doesn't roll back already-applied DPD state.
  • instance_stop detaches multicast members and activates the reconciler only after sled-agent acknowledges the Stop request, avoiding M2P / forwarding teardown for a still-running guest if Stop fails.

Test updates

  • Integration coverage for MRIB programming, DDM-vs-inventory drift, saga idempotent crash-recovery, per-switch invariant checks, and underlay MAC filter lifecycle.
  • New populate_ddm_peers test helper synthesizes DDM peer topology from datastore + inventory so tests exercise the production primary path instead of the inventory fallback that an empty DdmInstance would otherwise force. Cache keyed on the in-service sled-set so multi-sled fixtures rebuild on sled transitions.

@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-mgd-ddm branch 5 times, most recently from ca48d80 to 15e64aa Compare May 1, 2026 15:59
Wires MGD (MRIB programming) and DDM (live peer topology for
sled-to-switch-port resolution) into the multicast reconciler RPW. The
reconciler resolves sled-to-port mapping via DDM peers (primary, live
source) and falls back to inventory + DPD backplane when DDM is
unavailable. MRIB routes are advertised through MGD and withdrawn when
no "Joined" members remain.

Multicast is *instance networking* under the planned migration of
system-level networking from Nexus RPWs to sled-agent reconcilers
([omicron#10167](#10167
)).

### Sled-side underlay NIC filter programming

- `set_mcast_m2p` / `clear_mcast_m2p` in the OPTE port manager hold UDP
  sockets joined to the underlay multicast group on each underlay NIC.
  Joining the group on a held socket triggers `mac_multicast_add` in the
  kernel, which programs the per-NIC multicast MAC filter so cxgbe
  delivers frames to xde. Workaround for opte#908.
- Eager rehydration at sled-agent startup reopens those filter sockets
  for M2P entries that survive in xde across a restart. Rehydration
  failures clear the surviving M2P entry so convergence retries on the
  next pass instead of black-holing the group.

### Switch-zone integration

- New `MulticastSwitchZoneClient` fans out per-switch MGD and DDM
  clients, discovered via internal DNS SRV records. The reconciler uses
  it for MRIB writes and live peer queries (consuming the
  ddm-admin-client `GET /peers` endpoint that returns `if_name` / port
  info per peer).
- `ServiceName::Ddm` registered in internal DNS via `host_zone_switch`
  (now takes a `ddm_port`) so cross-sled consumers can discover `ddmd`
  in switch zones. RSS, the test starter, and `overridables_for_test`
  thread the new port through. The multicast reconciler is the first
  cross-sled consumer; previously, all `DdmAdminClient` callers were
  sled-local via `DdmAdminClient::localhost`.
- Resolver helper preserves SRV target names alongside resolved sockets,
  enabling per-target correlation when multiple switch zones share an
  address but differ by port.

*Note*: the first reconciler pass after upgrade publishes one new 
`_ddm._tcp` SRV record per switch zone, causing a one-time DNS generation 
bump.

### Instance-scoped multicast subscriptions

- v36 (`VERSION_MCAST_M2P_FORWARDING`) introduces
  `PUT/DELETE /instances/{instance_id}/multicast-group`, replacing the
  earlier VMM-keyed `/vmms/{propolis_id}/multicast-group` shape.
  Sled-agent resolves the active VMM under its instance-state lock and
  dispatches to OPTE atomically, eliminating a Nexus-side lookup-vs-call
  race where a migration commit could land subscriptions on a stale
  propolis.
- v7 endpoints remain on the trait as deprecated shims that perform the
  propolis-to-instance lookup and delegate to the new handler.
- Nexus drops `cached_propolis_id` and `lookup_propolis_id` plumbing
  through the reconciler entirely. `subscribe_vmm` / `unsubscribe_vmm`
  become `subscribe_instance` / `unsubscribe_instance`. 

### Per-pass sled-to-port resolution

Delivers the design captured in the prior TODO: prefer DDM's
authoritative view of sled-to-port reachability over inventory, with
inventory as cross-validation rather than the primary input.

- Replaces the previous TTL'd sled-mapping cache with a single-pass
  amortization built once at the top of the member reconciler pass and
  threaded through the per-pass reconciler context.
- DDM peer topology is the primary source. Inventory + DPD backplane is
  the fallback and supplements partial DDM coverage (per-sled gap-fill)
  rather than being all-or-nothing.
- Parsed peer port IDs are cross-validated against the DPD backplane
  map.
- Sequential per-switch fallback for shared-state DPD reads (backplane
  map, underlay group fetch), so a single unhealthy switch can't fail
  the whole read.

### Saga and RPW interaction

- Saga state guard widened: the DPD-ensure saga accepts "Active" as
  well as "Creating" so crash-recovery re-execution doesn't roll back
  already-applied DPD state.
- `instance_stop` detaches multicast members and activates the
  reconciler only after sled-agent acknowledges the Stop request,
  avoiding M2P / forwarding teardown for a still-running guest if Stop
  fails.

### Test updates

- Integration coverage for MRIB programming, DDM-vs-inventory drift,
  saga idempotent crash-recovery, per-switch invariant checks, and
  underlay MAC filter lifecycle.
- New `populate_ddm_peers` test helper synthesizes DDM peer topology
  from datastore + inventory so tests exercise the production primary
  path instead of the inventory fallback that an empty `DdmInstance`
  would otherwise force. Cache keyed on the in-service sled-set so
  multi-sled fixtures rebuild on sled transitions.
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-mgd-ddm branch from 15e64aa to 92d912b Compare May 1, 2026 16:06
@zeeshanlakhani zeeshanlakhani self-assigned this May 2, 2026
@zeeshanlakhani zeeshanlakhani requested a review from jgallagher May 2, 2026 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant