[multicast] connect MGD and DDM to Omicron#10346
Open
zeeshanlakhani wants to merge 1 commit intozl/multicast-m2p-forwardingfrom
Open
[multicast] connect MGD and DDM to Omicron#10346zeeshanlakhani wants to merge 1 commit intozl/multicast-m2p-forwardingfrom
zeeshanlakhani wants to merge 1 commit intozl/multicast-m2p-forwardingfrom
Conversation
ca48d80 to
15e64aa
Compare
Wires MGD (MRIB programming) and DDM (live peer topology for sled-to-switch-port resolution) into the multicast reconciler RPW. The reconciler resolves sled-to-port mapping via DDM peers (primary, live source) and falls back to inventory + DPD backplane when DDM is unavailable. MRIB routes are advertised through MGD and withdrawn when no "Joined" members remain. Multicast is *instance networking* under the planned migration of system-level networking from Nexus RPWs to sled-agent reconcilers ([omicron#10167](#10167 )). ### Sled-side underlay NIC filter programming - `set_mcast_m2p` / `clear_mcast_m2p` in the OPTE port manager hold UDP sockets joined to the underlay multicast group on each underlay NIC. Joining the group on a held socket triggers `mac_multicast_add` in the kernel, which programs the per-NIC multicast MAC filter so cxgbe delivers frames to xde. Workaround for opte#908. - Eager rehydration at sled-agent startup reopens those filter sockets for M2P entries that survive in xde across a restart. Rehydration failures clear the surviving M2P entry so convergence retries on the next pass instead of black-holing the group. ### Switch-zone integration - New `MulticastSwitchZoneClient` fans out per-switch MGD and DDM clients, discovered via internal DNS SRV records. The reconciler uses it for MRIB writes and live peer queries (consuming the ddm-admin-client `GET /peers` endpoint that returns `if_name` / port info per peer). - `ServiceName::Ddm` registered in internal DNS via `host_zone_switch` (now takes a `ddm_port`) so cross-sled consumers can discover `ddmd` in switch zones. RSS, the test starter, and `overridables_for_test` thread the new port through. The multicast reconciler is the first cross-sled consumer; previously, all `DdmAdminClient` callers were sled-local via `DdmAdminClient::localhost`. - Resolver helper preserves SRV target names alongside resolved sockets, enabling per-target correlation when multiple switch zones share an address but differ by port. *Note*: the first reconciler pass after upgrade publishes one new `_ddm._tcp` SRV record per switch zone, causing a one-time DNS generation bump. ### Instance-scoped multicast subscriptions - v36 (`VERSION_MCAST_M2P_FORWARDING`) introduces `PUT/DELETE /instances/{instance_id}/multicast-group`, replacing the earlier VMM-keyed `/vmms/{propolis_id}/multicast-group` shape. Sled-agent resolves the active VMM under its instance-state lock and dispatches to OPTE atomically, eliminating a Nexus-side lookup-vs-call race where a migration commit could land subscriptions on a stale propolis. - v7 endpoints remain on the trait as deprecated shims that perform the propolis-to-instance lookup and delegate to the new handler. - Nexus drops `cached_propolis_id` and `lookup_propolis_id` plumbing through the reconciler entirely. `subscribe_vmm` / `unsubscribe_vmm` become `subscribe_instance` / `unsubscribe_instance`. ### Per-pass sled-to-port resolution Delivers the design captured in the prior TODO: prefer DDM's authoritative view of sled-to-port reachability over inventory, with inventory as cross-validation rather than the primary input. - Replaces the previous TTL'd sled-mapping cache with a single-pass amortization built once at the top of the member reconciler pass and threaded through the per-pass reconciler context. - DDM peer topology is the primary source. Inventory + DPD backplane is the fallback and supplements partial DDM coverage (per-sled gap-fill) rather than being all-or-nothing. - Parsed peer port IDs are cross-validated against the DPD backplane map. - Sequential per-switch fallback for shared-state DPD reads (backplane map, underlay group fetch), so a single unhealthy switch can't fail the whole read. ### Saga and RPW interaction - Saga state guard widened: the DPD-ensure saga accepts "Active" as well as "Creating" so crash-recovery re-execution doesn't roll back already-applied DPD state. - `instance_stop` detaches multicast members and activates the reconciler only after sled-agent acknowledges the Stop request, avoiding M2P / forwarding teardown for a still-running guest if Stop fails. ### Test updates - Integration coverage for MRIB programming, DDM-vs-inventory drift, saga idempotent crash-recovery, per-switch invariant checks, and underlay MAC filter lifecycle. - New `populate_ddm_peers` test helper synthesizes DDM peer topology from datastore + inventory so tests exercise the production primary path instead of the inventory fallback that an empty `DdmInstance` would otherwise force. Cache keyed on the in-service sled-set so multi-sled fixtures rebuild on sled transitions.
15e64aa to
92d912b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wires MGD (MRIB programming) and DDM (live peer topology for sled-to-switch-port resolution) into the multicast reconciler RPW. The reconciler resolves sled-to-port mapping via DDM peers (primary, live source) and falls back to inventory + DPD backplane when DDM is unavailable. MRIB routes are advertised through MGD and withdrawn when no "Joined" members remain.
Multicast is instance networking under the planned migration of system-level networking from Nexus RPWs to sled-agent reconcilers (omicron#10167).
Sled-side underlay NIC filter programming
set_mcast_m2p/clear_mcast_m2pin the OPTE port manager hold UDP sockets joined to the underlay multicast group on each underlay NIC. Joining the group on a held socket triggersmac_multicast_addin the kernel, which programs the per-NIC multicast MAC filter so cxgbe delivers frames to xde. Workaround for opte#908.Switch-zone integration
MulticastSwitchZoneClientfans out per-switch MGD and DDM clients, discovered via internal DNS SRV records. The reconciler uses it for MRIB writes and live peer queries (consuming the ddm-admin-clientGET /peersendpoint that returnsif_name/ port info per peer).ServiceName::Ddmregistered in internal DNS viahost_zone_switch(now takes addm_port) so cross-sled consumers can discoverddmdin switch zones. RSS, the test starter, andoverridables_for_testthread the new port through. The multicast reconciler is the first cross-sled consumer; previously, allDdmAdminClientcallers were sled-local viaDdmAdminClient::localhost.Note: the first reconciler pass after upgrade publishes one new
_ddm._tcpSRV record per switch zone, causing a one-time DNS generation bump.Instance-scoped multicast subscriptions
VERSION_MCAST_M2P_FORWARDING) introducesPUT/DELETE /instances/{instance_id}/multicast-group, replacing the earlier VMM-keyed/vmms/{propolis_id}/multicast-groupshape. Sled-agent resolves the active VMM under its instance-state lock and dispatches to OPTE atomically, eliminating a Nexus-side lookup-vs-call race where a migration commit could land subscriptions on a stale propolis.cached_propolis_idandlookup_propolis_idplumbing through the reconciler entirely.subscribe_vmm/unsubscribe_vmmbecomesubscribe_instance/unsubscribe_instance.Per-pass sled-to-port resolution
Delivers the design captured in the prior TODO: prefer DDM's authoritative view of sled-to-port reachability over inventory, with inventory as cross-validation rather than the primary input.
Saga and RPW interaction
instance_stopdetaches multicast members and activates the reconciler only after sled-agent acknowledges the Stop request, avoiding M2P / forwarding teardown for a still-running guest if Stop fails.Test updates
populate_ddm_peerstest helper synthesizes DDM peer topology from datastore + inventory so tests exercise the production primary path instead of the inventory fallback that an emptyDdmInstancewould otherwise force. Cache keyed on the in-service sled-set so multi-sled fixtures rebuild on sled transitions.