Skip to content

SylphxAI/cluster-api-provider-hetzner-robot

CAPHR — Cluster API Provider Hetzner Robot

A Cluster API infrastructure provider for Hetzner Robot bare metal servers running Talos Linux.

Why Not CAPH?

CAPH (syself/cluster-api-provider-hetzner) supports Hetzner bare metal, but its provisioning flow relies on SSH + installimage + cloud-init end-to-end. Talos Linux has no SSH and no cloud-init — the two are fundamentally incompatible.

CAPH's maintainers explicitly declined Talos bare metal support (Issue #133, closed Aug 2024):

"We (Syself) won't invest time in the next months." — @guettli

This is not hostile — it's pragmatic. Syself uses kubeadm + Ubuntu; they can't maintain what they don't use. The door is open for community contributions, but nobody followed through.

CAPHR exists to fill this gap.

Design Philosophy

Talos-Specific by Design

Cloud infrastructure providers (AWS, GCP, Azure) are OS-agnostic — they write userdata and let cloud-init handle the rest. Bare metal has no metadata service. The infrastructure provider must directly interact with the OS to deliver configuration.

This is not a CAPHR limitation — it's inherent to bare metal CAPI. Sidero Metal (Siderolabs' own bare metal provider) is equally Talos-specific.

Cloud VM flow:     API → create VM → inject userdata → cloud-init handles it
                   Infrastructure provider never touches OS config ✅

Bare metal flow:   Rescue → write image → reboot → push config to OS API
                   Infrastructure provider MUST know the OS ⚠️

Separation of Concerns

CAPHR injects hardware facts only — information discovered at provisioning time that CABPT/CACPPT cannot know when generating the machineconfig:

Injected Source Why CABPT Can't Know
Install disk (/dev/disk/by-id/...) Rescue lsblk NVMe enumeration order differs between boots
Primary NIC MAC Rescue ip link Hardware-specific, unknown until SSH
Gateway IP Rescue ip route Hetzner-assigned, varies per server
Hostname (compute-fsn1-2938104) Robot API server ID Per-server identity
VLAN IP + static routes HetznerRobotHost spec Per-server network assignment
IPv6 address + routes Robot API server_ipv6_net Per-server allocation from Hetzner
Provider ID (hetzner-robot://ID) Robot API server ID CAPI Machine↔Node matching
Secretbox encryption key Cluster-level secret CABPT generates per-Machine keys; all CPs must share one
Service account key Cluster-level secret Same — all CPs must share for cross-node token validation

CAPHR does not inject application config (CNI settings, runtime classes, workload config). That belongs in the TalosControlPlane or TalosConfigTemplate specs managed by CABPT/CACPPT.

Multi-Document Config

CABPT/CACPPT generates the base machineconfig (document 1). CAPHR appends hardware-specific config as additional YAML documents rather than modifying the base:

# Document 1 — Generated by CABPT (untouched by CAPHR)
version: v1alpha1
kind: TalosConfig
machine:
  type: controlplane
  ...
---
# Document 2+ — Appended by CAPHR (storage volumes)
apiVersion: v1alpha1
kind: VolumeConfig
name: EPHEMERAL
provisioning:
  maxSize: 100GiB
---
apiVersion: v1alpha1
kind: RawVolumeConfig
name: osd-data
provisioning:
  diskSelector:
    match: system_disk

Hardware facts that modify the base document (MAC, hostname, VLAN, IPv6) are injected into document 1 via structured YAML manipulation — not string replacement. Each inject function is scoped to a single hardware fact and is independently testable.

Architecture

┌──────────────────────────────────────────────────────────────┐
│                        CAPI Core                             │
│              (cluster lifecycle, machine lifecycle)           │
└────────────┬────────────────────────────┬────────────────────┘
             │                            │
┌────────────▼────────────┐  ┌────────────▼────────────────────┐
│  CABPT / CACPPT         │  │  CAPHR                          │
│  (Talos bootstrap +     │  │  (Hetzner Robot infrastructure) │
│   control plane)        │  │                                  │
│                         │  │  pkg/robot/    Hetzner Robot API │
│  Generates machineconfig│  │  pkg/sshrescue/  Rescue SSH ops  │
│  Stores in Secret       │  │  pkg/talos/    Talos gRPC + TLS │
│  Does NOT connect to    │  │  controllers/  State machine     │
│  any machine            │  │                                  │
└─────────────────────────┘  │  Reads CABPT Secret              │
                             │  Discovers hardware facts         │
                             │  Injects facts into config        │
                             │  Delivers config via Talos gRPC   │
                             └──────────────────────────────────┘

CABPT/CACPPT generates the machineconfig template and stores it as a Kubernetes Secret. It never connects to any machine — it is a config generator only.

CAPHR reads the bootstrap Secret, manages the hardware lifecycle via Hetzner Robot API, discovers hardware facts during rescue, injects them into the machineconfig, and delivers the final config to Talos via gRPC. It is both the hardware manager and the config delivery agent — a necessary coupling for bare metal where no metadata service exists.

Provisioning Flow

  HetznerRobotMachine created
          │
          ▼
  ┌─── ActivatingRescue ───┐
  │  Robot API: activate    │
  │  rescue mode            │
  │  Robot API: hw reset    │
  └──────────┬──────────────┘
             ▼
  ┌─── CheckRescueActive ──┐
  │  SSH probe: is rescue?  │
  │  (hostname=rescue OR    │
  │   /etc/hetzner-build)   │
  └──────────┬──────────────┘
             ▼
  ┌─── InRescue ────────────┐
  │  SSH: detect primary MAC│
  │  SSH: detect gateway IP │
  │  SSH: resolve install   │
  │    disk (by-id path)    │
  │  SSH: curl | dd image   │
  │  SSH: fix EFI boot order│
  │  Robot API: deactivate  │
  │    rescue               │
  │  Robot API: hw reset    │
  └──────────┬──────────────┘
             ▼
  ┌─── BootingTalos ────────┐
  │  Poll: TCP port 50000   │
  │  Verify: maintenance    │
  │    mode (gRPC probe)    │
  └──────────┬──────────────┘
             ▼
  ┌─── ApplyingConfig ──────┐
  │  Read CABPT Secret      │
  │  Inject: install disk   │
  │  Inject: VLAN + routes  │
  │  Inject: hostname       │
  │  Inject: IPv6           │
  │  Inject: cluster secrets│
  │  Inject: provider ID    │
  │  Append: VolumeConfig   │
  │  gRPC: ApplyConfig      │
  └──────────┬──────────────┘
             ▼
  ┌─── WaitingForBoot ──────┐
  │  Wait for Talos reboot  │
  │  Poll: K8s API (6443)   │
  └──────────┬──────────────┘
             ▼
  ┌─── Provisioned ─────────┐
  │  status.ready = true    │
  └─────────────────────────┘

Error Handling

  • Transient errors (SSH timeout, API 503, network): exponential backoff (15s × 2^n, capped at 5min), tracked via retryCount
  • Permanent errors (missing secret, invalid config): immediate StateError, no retry
  • Max 8 rescue boot retries: if rescue never activates, enter StateError
  • Stale Talos detection: if Talos boots in full mode (not maintenance), re-enter rescue

CRDs

HetznerRobotCluster

Cluster-level configuration.

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HetznerRobotCluster
spec:
  controlPlaneEndpoint:
    host: "10.10.0.240"
    port: 6443
  robotSecretRef:
    name: hetzner-robot-credentials    # Keys: robot-user, robot-password
  sshSecretRef:
    name: hetzner-ssh-key              # Keys: ssh-privatekey, ssh-fingerprint
  dc: fsn1                             # Datacenter for hostname generation
  vlanConfig:                          # Optional internal network
    id: 4000
    interface: enp193s0f0np0
    prefixLength: 24
  talosSecretRef:                      # Optional shared cluster secrets
    name: talos-secrets                # Key: bundle (YAML)
  talosFactoryBaseURL: "https://factory.talos.dev"

HetznerRobotHost

Permanent physical server inventory. One per Hetzner dedicated server. Never deleted when machines are removed.

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HetznerRobotHost
metadata:
  name: node6
  labels:
    role: compute
spec:
  serverID: 2938104
  serverIP: ""           # Auto-detected from Robot API if empty
  serverIPv6Net: ""      # Auto-detected from Robot API if empty
  internalIP: 10.10.0.6  # VLAN IP (static assignment)
status:
  state: Available       # Available → Claimed → Provisioned → Deprovisioning

States: Available (in pool) → Claimed (assigned to a Machine) → Provisioned (running) → Deprovisioning (machine deleted, awaiting cleanup) → Available

HetznerRobotMachine

Per-machine provisioning config. Created by CAPI from TalosControlPlane or MachineDeployment.

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HetznerRobotMachine
spec:
  hostRef:
    name: node6                  # Direct assignment (OR hostSelector for pool)
  talosSchematic: "3da7f440..."  # Talos factory schematic ID
  talosVersion: "v1.12.4"
  installDisk: "/dev/nvme0n1"   # Default, overridden by rescue detection
  ephemeralSize: "100GiB"       # Optional: limit EPHEMERAL, create OSD partition

HetznerRobotMachineTemplate

Template for MachineDeployment (worker scaling).

HetznerRobotRemediation / HetznerRobotRemediationTemplate

MachineHealthCheck integration. When a node is unhealthy, CAPHR issues a hardware reset via Robot API and waits for recovery.

spec:
  strategy:
    type: Reboot
    retryLimit: 3       # Max hardware resets before giving up
    timeout: 300s       # Wait time after each reset

Package Structure

caphr/
├── api/v1alpha1/           CRD type definitions
│   ├── hetznerrobotcluster_types.go
│   ├── hetznerrobotmachine_types.go
│   ├── hetznerrobothost_types.go
│   └── hetznerrobotremediation_types.go
├── controllers/
│   ├── hetznerrobotmachine_controller.go   State machine + hardware fact injection
│   ├── hetznerrobothost_controller.go      Server inventory management
│   ├── hetznerrobotcluster_controller.go   Cluster lifecycle
│   └── hetznerrobotremediation_controller.go  MHC hardware reset
├── pkg/
│   ├── robot/              Hetzner Robot API client
│   │   ├── client.go       GetServer, ActivateRescue, ResetServer, etc.
│   │   └── from_secret.go  Credential loading from K8s Secret
│   ├── sshrescue/          Rescue mode SSH operations
│   │   └── client.go       MAC detection, disk detection, image install
│   └── talos/              Talos API integration
│       ├── client.go       gRPC connectivity, maintenance mode detection
│       └── bundle.go       Secret bundle parsing, admin TLS generation
├── examples/               Production cluster manifests
├── config/crd/             Generated CRD YAML
└── infra/                  Deployment manifests

Hetzner-Specific Workarounds

L2 Isolation (Static /32 Routing)

Hetzner DHCP assigns /25 or /26 prefixes. When two servers share the same subnet, the kernel sends ARP directly instead of routing through the gateway. Hetzner blocks direct L2 between servers, breaking connectivity.

Fix: Static /32 address + explicit on-link route for the gateway + default route via gateway. Forces all traffic through the router.

NVMe Device Name Instability

NVMe device names (/dev/nvme0n1, nvme1n1) depend on PCI probe order, which differs between Hetzner rescue Linux and Talos. A disk that's nvme0n1 in rescue might be nvme1n1 in Talos.

Fix: Resolve to stable /dev/disk/by-id/ path during rescue. Use this path in the machineconfig.

EFI Boot Order

After Talos installation, PXE might still be first in the EFI boot order, causing the server to network-boot instead of booting Talos.

Fix: Use efibootmgr in rescue to set Talos first, PXE last, delete stale entries.

Ceph OSD Disk Detection

Storage servers may have existing Ceph BlueStore signatures on disks. Installing Talos on a Ceph OSD disk would destroy data.

Fix: lsblk + blkid check for ceph_bluestore signature. Refuse to install on disks with active Ceph data. ephemeralSize spec creates a separate OSD-ready partition via Talos VolumeConfig.

Comparison with Alternatives

CAPHR CAPH (syself) Sidero Metal
Target OS Talos Ubuntu/Debian (kubeadm) Talos
Bare metal Hetzner Robot Hetzner Robot + Cloud Any (IPMI/PXE)
Provisioning Rescue SSH → dd → Talos gRPC Rescue SSH → installimage → cloud-init PXE → Agent → Talos API
Hardware discovery SSH in rescue mode SSH in rescue mode Agent on booted machine
Config delivery Direct gRPC push SSH + cloud-init Agent pull
OS-agnostic No (Talos-specific) Yes (SSH + cloud-init) No (Talos-specific)
Status Production (Sylphx) GA v1.0+ Deprecated (→ Omni)

Status

Production — running the Sylphx Platform infrastructure: 3 control plane + 3 worker nodes on Hetzner AX-series dedicated servers.

License

Apache 2.0

About

Cluster API Infrastructure Provider for Hetzner Robot bare metal — Talos-native (no cloud-init, no SSH)

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages