CAPHR — Cluster API Provider Hetzner Robot

A Cluster API infrastructure provider for Hetzner Robot bare metal servers running Talos Linux.

Why Not CAPH?

CAPH (syself/cluster-api-provider-hetzner) supports Hetzner bare metal, but its provisioning flow relies on SSH + installimage + cloud-init end-to-end. Talos Linux has no SSH and no cloud-init — the two are fundamentally incompatible.

CAPH's maintainers explicitly declined Talos bare metal support (Issue #133, closed Aug 2024):

"We (Syself) won't invest time in the next months." — @guettli

This is not hostile — it's pragmatic. Syself uses kubeadm + Ubuntu; they can't maintain what they don't use. The door is open for community contributions, but nobody followed through.

CAPHR exists to fill this gap.

Design Philosophy

Talos-Specific by Design

Cloud infrastructure providers (AWS, GCP, Azure) are OS-agnostic — they write userdata and let cloud-init handle the rest. Bare metal has no metadata service. The infrastructure provider must directly interact with the OS to deliver configuration.

This is not a CAPHR limitation — it's inherent to bare metal CAPI. Sidero Metal (Siderolabs' own bare metal provider) is equally Talos-specific.

Cloud VM flow:     API → create VM → inject userdata → cloud-init handles it
                   Infrastructure provider never touches OS config ✅

Bare metal flow:   Rescue → write image → reboot → push config to OS API
                   Infrastructure provider MUST know the OS ⚠️

Separation of Concerns

CAPHR injects hardware facts only — information discovered at provisioning time that CABPT/CACPPT cannot know when generating the machineconfig:

Injected	Source	Why CABPT Can't Know
Install disk (`/dev/disk/by-id/...`)	Rescue `lsblk`	NVMe enumeration order differs between boots
Primary NIC MAC	Rescue `ip link`	Hardware-specific, unknown until SSH
Gateway IP	Rescue `ip route`	Hetzner-assigned, varies per server
Hostname (`compute-fsn1-2938104`)	Robot API server ID	Per-server identity
VLAN IP + static routes	HetznerRobotHost spec	Per-server network assignment
IPv6 address + routes	Robot API `server_ipv6_net`	Per-server allocation from Hetzner
Provider ID (`hetzner-robot://ID`)	Robot API server ID	CAPI Machine↔Node matching
Secretbox encryption key	Cluster-level secret	CABPT generates per-Machine keys; all CPs must share one
Service account key	Cluster-level secret	Same — all CPs must share for cross-node token validation

CAPHR does not inject application config (CNI settings, runtime classes, workload config). That belongs in the TalosControlPlane or TalosConfigTemplate specs managed by CABPT/CACPPT.

Multi-Document Config

CABPT/CACPPT generates the base machineconfig (document 1). CAPHR appends hardware-specific config as additional YAML documents rather than modifying the base:

# Document 1 — Generated by CABPT (untouched by CAPHR)
version: v1alpha1
kind: TalosConfig
machine:
  type: controlplane
  ...
---
# Document 2+ — Appended by CAPHR (storage volumes)
apiVersion: v1alpha1
kind: VolumeConfig
name: EPHEMERAL
provisioning:
  maxSize: 100GiB
---
apiVersion: v1alpha1
kind: RawVolumeConfig
name: osd-data
provisioning:
  diskSelector:
    match: system_disk

Hardware facts that modify the base document (MAC, hostname, VLAN, IPv6) are injected into document 1 via structured YAML manipulation — not string replacement. Each inject function is scoped to a single hardware fact and is independently testable.

Architecture

┌──────────────────────────────────────────────────────────────┐
│                        CAPI Core                             │
│              (cluster lifecycle, machine lifecycle)           │
└────────────┬────────────────────────────┬────────────────────┘
             │                            │
┌────────────▼────────────┐  ┌────────────▼────────────────────┐
│  CABPT / CACPPT         │  │  CAPHR                          │
│  (Talos bootstrap +     │  │  (Hetzner Robot infrastructure) │
│   control plane)        │  │                                  │
│                         │  │  pkg/robot/    Hetzner Robot API │
│  Generates machineconfig│  │  pkg/sshrescue/  Rescue SSH ops  │
│  Stores in Secret       │  │  pkg/talos/    Talos gRPC + TLS │
│  Does NOT connect to    │  │  controllers/  State machine     │
│  any machine            │  │                                  │
└─────────────────────────┘  │  Reads CABPT Secret              │
                             │  Discovers hardware facts         │
                             │  Injects facts into config        │
                             │  Delivers config via Talos gRPC   │
                             └──────────────────────────────────┘

CABPT/CACPPT generates the machineconfig template and stores it as a Kubernetes Secret. It never connects to any machine — it is a config generator only.

CAPHR reads the bootstrap Secret, manages the hardware lifecycle via Hetzner Robot API, discovers hardware facts during rescue, injects them into the machineconfig, and delivers the final config to Talos via gRPC. It is both the hardware manager and the config delivery agent — a necessary coupling for bare metal where no metadata service exists.

Provisioning Flow

  HetznerRobotMachine created
          │
          ▼
  ┌─── ActivatingRescue ───┐
  │  Robot API: activate    │
  │  rescue mode            │
  │  Robot API: hw reset    │
  └──────────┬──────────────┘
             ▼
  ┌─── CheckRescueActive ──┐
  │  SSH probe: is rescue?  │
  │  (hostname=rescue OR    │
  │   /etc/hetzner-build)   │
  └──────────┬──────────────┘
             ▼
  ┌─── InRescue ────────────┐
  │  SSH: detect primary MAC│
  │  SSH: detect gateway IP │
  │  SSH: resolve install   │
  │    disk (by-id path)    │
  │  SSH: curl | dd image   │
  │  SSH: fix EFI boot order│
  │  Robot API: deactivate  │
  │    rescue               │
  │  Robot API: hw reset    │
  └──────────┬──────────────┘
             ▼
  ┌─── BootingTalos ────────┐
  │  Poll: TCP port 50000   │
  │  Verify: maintenance    │
  │    mode (gRPC probe)    │
  └──────────┬──────────────┘
             ▼
  ┌─── ApplyingConfig ──────┐
  │  Read CABPT Secret      │
  │  Inject: install disk   │
  │  Inject: VLAN + routes  │
  │  Inject: hostname       │
  │  Inject: IPv6           │
  │  Inject: cluster secrets│
  │  Inject: provider ID    │
  │  Append: VolumeConfig   │
  │  gRPC: ApplyConfig      │
  └──────────┬──────────────┘
             ▼
  ┌─── WaitingForBoot ──────┐
  │  Wait for Talos reboot  │
  │  Poll: K8s API (6443)   │
  └──────────┬──────────────┘
             ▼
  ┌─── Provisioned ─────────┐
  │  status.ready = true    │
  └─────────────────────────┘

Error Handling

Transient errors (SSH timeout, API 503, network): exponential backoff (15s × 2^n, capped at 5min), tracked via retryCount
Permanent errors (missing secret, invalid config): immediate StateError, no retry
Max 8 rescue boot retries: if rescue never activates, enter StateError
Stale Talos detection: if Talos boots in full mode (not maintenance), re-enter rescue

CRDs

HetznerRobotCluster

Cluster-level configuration.

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HetznerRobotCluster
spec:
  controlPlaneEndpoint:
    host: "10.10.0.240"
    port: 6443
  robotSecretRef:
    name: hetzner-robot-credentials    # Keys: robot-user, robot-password
  sshSecretRef:
    name: hetzner-ssh-key              # Keys: ssh-privatekey, ssh-fingerprint
  dc: fsn1                             # Datacenter for hostname generation
  vlanConfig:                          # Optional internal network
    id: 4000
    interface: enp193s0f0np0
    prefixLength: 24
  talosSecretRef:                      # Optional shared cluster secrets
    name: talos-secrets                # Key: bundle (YAML)
  talosFactoryBaseURL: "https://factory.talos.dev"

HetznerRobotHost

Permanent physical server inventory. One per Hetzner dedicated server. Never deleted when machines are removed.

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HetznerRobotHost
metadata:
  name: node6
  labels:
    role: compute
spec:
  serverID: 2938104
  serverIP: ""           # Auto-detected from Robot API if empty
  serverIPv6Net: ""      # Auto-detected from Robot API if empty
  internalIP: 10.10.0.6  # VLAN IP (static assignment)
status:
  state: Available       # Available → Claimed → Provisioned → Deprovisioning

States: Available (in pool) → Claimed (assigned to a Machine) → Provisioned (running) → Deprovisioning (machine deleted, awaiting cleanup) → Available

HetznerRobotMachine

Per-machine provisioning config. Created by CAPI from TalosControlPlane or MachineDeployment.

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HetznerRobotMachine
spec:
  hostRef:
    name: node6                  # Direct assignment (OR hostSelector for pool)
  talosSchematic: "3da7f440..."  # Talos factory schematic ID
  talosVersion: "v1.12.4"
  installDisk: "/dev/nvme0n1"   # Default, overridden by rescue detection
  ephemeralSize: "100GiB"       # Optional: limit EPHEMERAL, create OSD partition

HetznerRobotMachineTemplate

Template for MachineDeployment (worker scaling).

HetznerRobotRemediation / HetznerRobotRemediationTemplate

MachineHealthCheck integration. When a node is unhealthy, CAPHR issues a hardware reset via Robot API and waits for recovery.

spec:
  strategy:
    type: Reboot
    retryLimit: 3       # Max hardware resets before giving up
    timeout: 300s       # Wait time after each reset

Package Structure

caphr/
├── api/v1alpha1/           CRD type definitions
│   ├── hetznerrobotcluster_types.go
│   ├── hetznerrobotmachine_types.go
│   ├── hetznerrobothost_types.go
│   └── hetznerrobotremediation_types.go
├── controllers/
│   ├── hetznerrobotmachine_controller.go   State machine + hardware fact injection
│   ├── hetznerrobothost_controller.go      Server inventory management
│   ├── hetznerrobotcluster_controller.go   Cluster lifecycle
│   └── hetznerrobotremediation_controller.go  MHC hardware reset
├── pkg/
│   ├── robot/              Hetzner Robot API client
│   │   ├── client.go       GetServer, ActivateRescue, ResetServer, etc.
│   │   └── from_secret.go  Credential loading from K8s Secret
│   ├── sshrescue/          Rescue mode SSH operations
│   │   └── client.go       MAC detection, disk detection, image install
│   └── talos/              Talos API integration
│       ├── client.go       gRPC connectivity, maintenance mode detection
│       └── bundle.go       Secret bundle parsing, admin TLS generation
├── examples/               Production cluster manifests
├── config/crd/             Generated CRD YAML
└── infra/                  Deployment manifests

Hetzner-Specific Workarounds

L2 Isolation (Static /32 Routing)

Hetzner DHCP assigns /25 or /26 prefixes. When two servers share the same subnet, the kernel sends ARP directly instead of routing through the gateway. Hetzner blocks direct L2 between servers, breaking connectivity.

Fix: Static /32 address + explicit on-link route for the gateway + default route via gateway. Forces all traffic through the router.

NVMe Device Name Instability

NVMe device names (/dev/nvme0n1, nvme1n1) depend on PCI probe order, which differs between Hetzner rescue Linux and Talos. A disk that's nvme0n1 in rescue might be nvme1n1 in Talos.

Fix: Resolve to stable /dev/disk/by-id/ path during rescue. Use this path in the machineconfig.

EFI Boot Order

After Talos installation, PXE might still be first in the EFI boot order, causing the server to network-boot instead of booting Talos.

Fix: Use efibootmgr in rescue to set Talos first, PXE last, delete stale entries.

Ceph OSD Disk Detection

Storage servers may have existing Ceph BlueStore signatures on disks. Installing Talos on a Ceph OSD disk would destroy data.

Fix: lsblk + blkid check for ceph_bluestore signature. Refuse to install on disks with active Ceph data. ephemeralSize spec creates a separate OSD-ready partition via Talos VolumeConfig.

Comparison with Alternatives

	CAPHR	CAPH (syself)	Sidero Metal
Target OS	Talos	Ubuntu/Debian (kubeadm)	Talos
Bare metal	Hetzner Robot	Hetzner Robot + Cloud	Any (IPMI/PXE)
Provisioning	Rescue SSH → dd → Talos gRPC	Rescue SSH → installimage → cloud-init	PXE → Agent → Talos API
Hardware discovery	SSH in rescue mode	SSH in rescue mode	Agent on booted machine
Config delivery	Direct gRPC push	SSH + cloud-init	Agent pull
OS-agnostic	No (Talos-specific)	Yes (SSH + cloud-init)	No (Talos-specific)
Status	Production (Sylphx)	GA v1.0+	Deprecated (→ Omni)

Status

Production — running the Sylphx Platform infrastructure: 3 control plane + 3 worker nodes on Hetzner AX-series dedicated servers.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
.github/workflows		.github/workflows
api/v1alpha1		api/v1alpha1
config		config
controllers		controllers
docs		docs
examples		examples
infra/prod-cluster		infra/prod-cluster
pkg		pkg
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAPHR — Cluster API Provider Hetzner Robot

Why Not CAPH?

Design Philosophy

Talos-Specific by Design

Separation of Concerns

Multi-Document Config

Architecture

Provisioning Flow

Error Handling

CRDs

HetznerRobotCluster

HetznerRobotHost

HetznerRobotMachine

HetznerRobotMachineTemplate

HetznerRobotRemediation / HetznerRobotRemediationTemplate

Package Structure

Hetzner-Specific Workarounds

L2 Isolation (Static /32 Routing)

NVMe Device Name Instability

EFI Boot Order

Ceph OSD Disk Detection

Comparison with Alternatives

Status

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CAPHR — Cluster API Provider Hetzner Robot

Why Not CAPH?

Design Philosophy

Talos-Specific by Design

Separation of Concerns

Multi-Document Config

Architecture

Provisioning Flow

Error Handling

CRDs

HetznerRobotCluster

HetznerRobotHost

HetznerRobotMachine

HetznerRobotMachineTemplate

HetznerRobotRemediation / HetznerRobotRemediationTemplate

Package Structure

Hetzner-Specific Workarounds

L2 Isolation (Static /32 Routing)

NVMe Device Name Instability

EFI Boot Order

Ceph OSD Disk Detection

Comparison with Alternatives

Status

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages