A Cluster API infrastructure provider for Hetzner Robot bare metal servers running Talos Linux.
CAPH (syself/cluster-api-provider-hetzner) supports Hetzner bare metal, but its provisioning flow relies on SSH + installimage + cloud-init end-to-end. Talos Linux has no SSH and no cloud-init — the two are fundamentally incompatible.
CAPH's maintainers explicitly declined Talos bare metal support (Issue #133, closed Aug 2024):
"We (Syself) won't invest time in the next months." — @guettli
This is not hostile — it's pragmatic. Syself uses kubeadm + Ubuntu; they can't maintain what they don't use. The door is open for community contributions, but nobody followed through.
CAPHR exists to fill this gap.
Cloud infrastructure providers (AWS, GCP, Azure) are OS-agnostic — they write userdata and let cloud-init handle the rest. Bare metal has no metadata service. The infrastructure provider must directly interact with the OS to deliver configuration.
This is not a CAPHR limitation — it's inherent to bare metal CAPI. Sidero Metal (Siderolabs' own bare metal provider) is equally Talos-specific.
Cloud VM flow: API → create VM → inject userdata → cloud-init handles it
Infrastructure provider never touches OS config ✅
Bare metal flow: Rescue → write image → reboot → push config to OS API
Infrastructure provider MUST know the OS ⚠️
CAPHR injects hardware facts only — information discovered at provisioning time that CABPT/CACPPT cannot know when generating the machineconfig:
| Injected | Source | Why CABPT Can't Know |
|---|---|---|
Install disk (/dev/disk/by-id/...) |
Rescue lsblk |
NVMe enumeration order differs between boots |
| Primary NIC MAC | Rescue ip link |
Hardware-specific, unknown until SSH |
| Gateway IP | Rescue ip route |
Hetzner-assigned, varies per server |
Hostname (compute-fsn1-2938104) |
Robot API server ID | Per-server identity |
| VLAN IP + static routes | HetznerRobotHost spec | Per-server network assignment |
| IPv6 address + routes | Robot API server_ipv6_net |
Per-server allocation from Hetzner |
Provider ID (hetzner-robot://ID) |
Robot API server ID | CAPI Machine↔Node matching |
| Secretbox encryption key | Cluster-level secret | CABPT generates per-Machine keys; all CPs must share one |
| Service account key | Cluster-level secret | Same — all CPs must share for cross-node token validation |
CAPHR does not inject application config (CNI settings, runtime classes, workload config). That belongs in the TalosControlPlane or TalosConfigTemplate specs managed by CABPT/CACPPT.
CABPT/CACPPT generates the base machineconfig (document 1). CAPHR appends hardware-specific config as additional YAML documents rather than modifying the base:
# Document 1 — Generated by CABPT (untouched by CAPHR)
version: v1alpha1
kind: TalosConfig
machine:
type: controlplane
...
---
# Document 2+ — Appended by CAPHR (storage volumes)
apiVersion: v1alpha1
kind: VolumeConfig
name: EPHEMERAL
provisioning:
maxSize: 100GiB
---
apiVersion: v1alpha1
kind: RawVolumeConfig
name: osd-data
provisioning:
diskSelector:
match: system_diskHardware facts that modify the base document (MAC, hostname, VLAN, IPv6) are injected into document 1 via structured YAML manipulation — not string replacement. Each inject function is scoped to a single hardware fact and is independently testable.
┌──────────────────────────────────────────────────────────────┐
│ CAPI Core │
│ (cluster lifecycle, machine lifecycle) │
└────────────┬────────────────────────────┬────────────────────┘
│ │
┌────────────▼────────────┐ ┌────────────▼────────────────────┐
│ CABPT / CACPPT │ │ CAPHR │
│ (Talos bootstrap + │ │ (Hetzner Robot infrastructure) │
│ control plane) │ │ │
│ │ │ pkg/robot/ Hetzner Robot API │
│ Generates machineconfig│ │ pkg/sshrescue/ Rescue SSH ops │
│ Stores in Secret │ │ pkg/talos/ Talos gRPC + TLS │
│ Does NOT connect to │ │ controllers/ State machine │
│ any machine │ │ │
└─────────────────────────┘ │ Reads CABPT Secret │
│ Discovers hardware facts │
│ Injects facts into config │
│ Delivers config via Talos gRPC │
└──────────────────────────────────┘
CABPT/CACPPT generates the machineconfig template and stores it as a Kubernetes Secret. It never connects to any machine — it is a config generator only.
CAPHR reads the bootstrap Secret, manages the hardware lifecycle via Hetzner Robot API, discovers hardware facts during rescue, injects them into the machineconfig, and delivers the final config to Talos via gRPC. It is both the hardware manager and the config delivery agent — a necessary coupling for bare metal where no metadata service exists.
HetznerRobotMachine created
│
▼
┌─── ActivatingRescue ───┐
│ Robot API: activate │
│ rescue mode │
│ Robot API: hw reset │
└──────────┬──────────────┘
▼
┌─── CheckRescueActive ──┐
│ SSH probe: is rescue? │
│ (hostname=rescue OR │
│ /etc/hetzner-build) │
└──────────┬──────────────┘
▼
┌─── InRescue ────────────┐
│ SSH: detect primary MAC│
│ SSH: detect gateway IP │
│ SSH: resolve install │
│ disk (by-id path) │
│ SSH: curl | dd image │
│ SSH: fix EFI boot order│
│ Robot API: deactivate │
│ rescue │
│ Robot API: hw reset │
└──────────┬──────────────┘
▼
┌─── BootingTalos ────────┐
│ Poll: TCP port 50000 │
│ Verify: maintenance │
│ mode (gRPC probe) │
└──────────┬──────────────┘
▼
┌─── ApplyingConfig ──────┐
│ Read CABPT Secret │
│ Inject: install disk │
│ Inject: VLAN + routes │
│ Inject: hostname │
│ Inject: IPv6 │
│ Inject: cluster secrets│
│ Inject: provider ID │
│ Append: VolumeConfig │
│ gRPC: ApplyConfig │
└──────────┬──────────────┘
▼
┌─── WaitingForBoot ──────┐
│ Wait for Talos reboot │
│ Poll: K8s API (6443) │
└──────────┬──────────────┘
▼
┌─── Provisioned ─────────┐
│ status.ready = true │
└─────────────────────────┘
- Transient errors (SSH timeout, API 503, network): exponential backoff (15s × 2^n, capped at 5min), tracked via
retryCount - Permanent errors (missing secret, invalid config): immediate
StateError, no retry - Max 8 rescue boot retries: if rescue never activates, enter
StateError - Stale Talos detection: if Talos boots in full mode (not maintenance), re-enter rescue
Cluster-level configuration.
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HetznerRobotCluster
spec:
controlPlaneEndpoint:
host: "10.10.0.240"
port: 6443
robotSecretRef:
name: hetzner-robot-credentials # Keys: robot-user, robot-password
sshSecretRef:
name: hetzner-ssh-key # Keys: ssh-privatekey, ssh-fingerprint
dc: fsn1 # Datacenter for hostname generation
vlanConfig: # Optional internal network
id: 4000
interface: enp193s0f0np0
prefixLength: 24
talosSecretRef: # Optional shared cluster secrets
name: talos-secrets # Key: bundle (YAML)
talosFactoryBaseURL: "https://factory.talos.dev"Permanent physical server inventory. One per Hetzner dedicated server. Never deleted when machines are removed.
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HetznerRobotHost
metadata:
name: node6
labels:
role: compute
spec:
serverID: 2938104
serverIP: "" # Auto-detected from Robot API if empty
serverIPv6Net: "" # Auto-detected from Robot API if empty
internalIP: 10.10.0.6 # VLAN IP (static assignment)
status:
state: Available # Available → Claimed → Provisioned → DeprovisioningStates: Available (in pool) → Claimed (assigned to a Machine) → Provisioned (running) → Deprovisioning (machine deleted, awaiting cleanup) → Available
Per-machine provisioning config. Created by CAPI from TalosControlPlane or MachineDeployment.
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HetznerRobotMachine
spec:
hostRef:
name: node6 # Direct assignment (OR hostSelector for pool)
talosSchematic: "3da7f440..." # Talos factory schematic ID
talosVersion: "v1.12.4"
installDisk: "/dev/nvme0n1" # Default, overridden by rescue detection
ephemeralSize: "100GiB" # Optional: limit EPHEMERAL, create OSD partitionTemplate for MachineDeployment (worker scaling).
MachineHealthCheck integration. When a node is unhealthy, CAPHR issues a hardware reset via Robot API and waits for recovery.
spec:
strategy:
type: Reboot
retryLimit: 3 # Max hardware resets before giving up
timeout: 300s # Wait time after each resetcaphr/
├── api/v1alpha1/ CRD type definitions
│ ├── hetznerrobotcluster_types.go
│ ├── hetznerrobotmachine_types.go
│ ├── hetznerrobothost_types.go
│ └── hetznerrobotremediation_types.go
├── controllers/
│ ├── hetznerrobotmachine_controller.go State machine + hardware fact injection
│ ├── hetznerrobothost_controller.go Server inventory management
│ ├── hetznerrobotcluster_controller.go Cluster lifecycle
│ └── hetznerrobotremediation_controller.go MHC hardware reset
├── pkg/
│ ├── robot/ Hetzner Robot API client
│ │ ├── client.go GetServer, ActivateRescue, ResetServer, etc.
│ │ └── from_secret.go Credential loading from K8s Secret
│ ├── sshrescue/ Rescue mode SSH operations
│ │ └── client.go MAC detection, disk detection, image install
│ └── talos/ Talos API integration
│ ├── client.go gRPC connectivity, maintenance mode detection
│ └── bundle.go Secret bundle parsing, admin TLS generation
├── examples/ Production cluster manifests
├── config/crd/ Generated CRD YAML
└── infra/ Deployment manifests
Hetzner DHCP assigns /25 or /26 prefixes. When two servers share the same subnet, the kernel sends ARP directly instead of routing through the gateway. Hetzner blocks direct L2 between servers, breaking connectivity.
Fix: Static /32 address + explicit on-link route for the gateway + default route via gateway. Forces all traffic through the router.
NVMe device names (/dev/nvme0n1, nvme1n1) depend on PCI probe order, which differs between Hetzner rescue Linux and Talos. A disk that's nvme0n1 in rescue might be nvme1n1 in Talos.
Fix: Resolve to stable /dev/disk/by-id/ path during rescue. Use this path in the machineconfig.
After Talos installation, PXE might still be first in the EFI boot order, causing the server to network-boot instead of booting Talos.
Fix: Use efibootmgr in rescue to set Talos first, PXE last, delete stale entries.
Storage servers may have existing Ceph BlueStore signatures on disks. Installing Talos on a Ceph OSD disk would destroy data.
Fix: lsblk + blkid check for ceph_bluestore signature. Refuse to install on disks with active Ceph data. ephemeralSize spec creates a separate OSD-ready partition via Talos VolumeConfig.
| CAPHR | CAPH (syself) | Sidero Metal | |
|---|---|---|---|
| Target OS | Talos | Ubuntu/Debian (kubeadm) | Talos |
| Bare metal | Hetzner Robot | Hetzner Robot + Cloud | Any (IPMI/PXE) |
| Provisioning | Rescue SSH → dd → Talos gRPC | Rescue SSH → installimage → cloud-init | PXE → Agent → Talos API |
| Hardware discovery | SSH in rescue mode | SSH in rescue mode | Agent on booted machine |
| Config delivery | Direct gRPC push | SSH + cloud-init | Agent pull |
| OS-agnostic | No (Talos-specific) | Yes (SSH + cloud-init) | No (Talos-specific) |
| Status | Production (Sylphx) | GA v1.0+ | Deprecated (→ Omni) |
Production — running the Sylphx Platform infrastructure: 3 control plane + 3 worker nodes on Hetzner AX-series dedicated servers.
Apache 2.0