Skip to content

feat: DNS-based service discovery for Monarch actor pools #771

@IngmarVG-IB

Description

@IngmarVG-IB

Summary

Monarch actor pools (generator, trainer, evaluator, replay_buffer) currently rely on hard-coded coordinator addresses or Monarch-internal RPC for peer discovery. This proposal adds opt-in DNS-AID service discovery so that pools can self-register SVCB records on startup and discover peers via DNS, enabling cross-cluster and external service discovery.

Motivation

In multi-cluster or hybrid deployments, services need a way to find each other without static configuration. DNS-based service discovery (via DNS-AID SVCB records) is lightweight, infrastructure-agnostic, and doesn't require a separate service mesh or registry.

Proposed Changes

  • New DnsAidConfig dataclass on ServiceConfig with enabled, name, domain, port, ttl, capabilities, category fields
  • New src/forge/controller/dns_aid.py helper module wrapping dns_aid.publish/unpublish/discover with:
    • Dual enable guard (env var DNS_AID_ENABLED + config flag)
    • Best-effort error handling (never blocks service startup/shutdown)
    • Exponential backoff retry for peer discovery
    • Lazy import (dns-aid package is optional)
  • ForgeActor.as_service() calls publish_service() after initialization
  • Provisioner.shutdown_all_allocations() calls unpublish_service() before teardown
  • dns-aid added as optional dependency (pip install forge[dns-aid])
  • Unit tests with mocked dns_aid calls
  • Documentation

Design Constraints

  • Fully opt-in: DNS_AID_ENABLED=false by default, no impact on existing deployments
  • No new required dependencies: dns-aid is optional
  • Best-effort: DNS failures never block service lifecycle
  • TTL-based fallback: Dead workers expire automatically (default 30s)

Happy to hear if this is something the team would consider, or if you'd prefer this as a separate community package.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions