Summary
Monarch actor pools (generator, trainer, evaluator, replay_buffer) currently rely on hard-coded coordinator addresses or Monarch-internal RPC for peer discovery. This proposal adds opt-in DNS-AID service discovery so that pools can self-register SVCB records on startup and discover peers via DNS, enabling cross-cluster and external service discovery.
Motivation
In multi-cluster or hybrid deployments, services need a way to find each other without static configuration. DNS-based service discovery (via DNS-AID SVCB records) is lightweight, infrastructure-agnostic, and doesn't require a separate service mesh or registry.
Proposed Changes
- New
DnsAidConfig dataclass on ServiceConfig with enabled, name, domain, port, ttl, capabilities, category fields
- New
src/forge/controller/dns_aid.py helper module wrapping dns_aid.publish/unpublish/discover with:
- Dual enable guard (env var
DNS_AID_ENABLED + config flag)
- Best-effort error handling (never blocks service startup/shutdown)
- Exponential backoff retry for peer discovery
- Lazy import (dns-aid package is optional)
ForgeActor.as_service() calls publish_service() after initialization
Provisioner.shutdown_all_allocations() calls unpublish_service() before teardown
dns-aid added as optional dependency (pip install forge[dns-aid])
- Unit tests with mocked dns_aid calls
- Documentation
Design Constraints
- Fully opt-in:
DNS_AID_ENABLED=false by default, no impact on existing deployments
- No new required dependencies: dns-aid is optional
- Best-effort: DNS failures never block service lifecycle
- TTL-based fallback: Dead workers expire automatically (default 30s)
Happy to hear if this is something the team would consider, or if you'd prefer this as a separate community package.
Summary
Monarch actor pools (generator, trainer, evaluator, replay_buffer) currently rely on hard-coded coordinator addresses or Monarch-internal RPC for peer discovery. This proposal adds opt-in DNS-AID service discovery so that pools can self-register SVCB records on startup and discover peers via DNS, enabling cross-cluster and external service discovery.
Motivation
In multi-cluster or hybrid deployments, services need a way to find each other without static configuration. DNS-based service discovery (via DNS-AID SVCB records) is lightweight, infrastructure-agnostic, and doesn't require a separate service mesh or registry.
Proposed Changes
DnsAidConfigdataclass onServiceConfigwithenabled,name,domain,port,ttl,capabilities,categoryfieldssrc/forge/controller/dns_aid.pyhelper module wrappingdns_aid.publish/unpublish/discoverwith:DNS_AID_ENABLED+ config flag)ForgeActor.as_service()callspublish_service()after initializationProvisioner.shutdown_all_allocations()callsunpublish_service()before teardowndns-aidadded as optional dependency (pip install forge[dns-aid])Design Constraints
DNS_AID_ENABLED=falseby default, no impact on existing deploymentsHappy to hear if this is something the team would consider, or if you'd prefer this as a separate community package.