fix(logical-backup): wait for PG connectivity before running backup by aslafy-z · Pull Request #3069 · zalando/postgres-operator

aslafy-z · 2026-04-10T15:03:08Z

Problem description

The logical-backup script (dump.sh) connects to the target PostgreSQL pod immediately after resolving its IP via the Kubernetes API. When NetworkPolicy is enforced via iptables, a newly-created pod's IP may not yet be present in the destination node's ingress allow lists, causing cross-node connections to be rejected until the next policy sync.

This manifests as intermittent Connection refused errors even though the target PostgreSQL pod is healthy and listening:

psql: error: connection to server at "x.x.x.x", port 5432 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?

The rejection happens at the network layer on the destination node, not at PostgreSQL. The race window is typically under 1 second but is enough to cause consistent failures because dump.sh connects with zero delay after pod startup.

This PR adds a pg_isready retry loop before the dump starts. It returns as soon as the connection succeeds, adding near-zero overhead when connectivity is immediate. Retry count and delay are configurable via LOGICAL_BACKUP_CONNECT_RETRIES (default: 10) and LOGICAL_BACKUP_CONNECT_RETRY_DELAY (default: 2s), compatible with logical_backup_cronjob_environment_secret.

Linked issues

Rework NPC's iptables-restore Logic To Use --noflush cloudnativelabs/kube-router#1372 - kube-router's iptables-restore without --noflush causes full chain rebuilds on every pod event, creating a race window where new pod IPs are not yet in the destination node's ingress ipsets. Milestoned for kube-router v2.2.0 but any iptables-based CNI that does full restores on pod events can trigger the same issue.

Checklist

Your go code is formatted. Your IDE should do it automatically for you.
- N/A: shell script change only
You have updated generated code when introducing new fields to the acid.zalan.do api package.
- N/A: no API changes
New configuration options are reflected in CRD validation, helm charts and sample manifests.
- N/A: env vars are optional with defaults, passed through existing logical_backup_cronjob_environment_secret
New functionality is covered by unit and/or e2e tests.
- Tested locally with Docker PostgreSQL: immediate return when PG reachable, correct failure after retries, correct retry on delayed PG start, correct handling of custom env vars
You have checked existing open PRs for possible overlay and referenced them.
- No overlapping PRs found

The backup script connects to the target PostgreSQL pod immediately after resolving its IP via the Kubernetes API. When NetworkPolicy is enforced via iptables, a newly-created pod's IP may not yet be present in the destination node's ingress allow lists, causing cross-node connections to be rejected until the next policy sync. This adds a pg_isready retry loop before the dump starts, with configurable retries and delay via LOGICAL_BACKUP_CONNECT_RETRIES (default: 10) and LOGICAL_BACKUP_CONNECT_RETRY_DELAY (default: 2s). Signed-off-by: Zadkiel AHARONIAN <zaharonian@ccl-consulting.fr>

zalando-robot · 2026-04-10T15:03:16Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

Document the new environment variables that control the pg_isready retry loop added in the previous commit. These are passed via the existing logical_backup_cronjob_environment_secret mechanism. Signed-off-by: Zadkiel AHARONIAN <zaharonian@ccl-consulting.fr>

zalando-robot · 2026-04-10T15:06:16Z

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

aslafy-z requested review from FxKu, Jan-M, hughcapet, idanovinda, jopadi, mikkeloscar and sdudoladov as code owners April 10, 2026 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(logical-backup): wait for PG connectivity before running backup#3069

fix(logical-backup): wait for PG connectivity before running backup#3069
aslafy-z wants to merge 2 commits intozalando:masterfrom
aslafy-z:fix/logical-backup-wait-for-connectivity

aslafy-z commented Apr 10, 2026

Uh oh!

zalando-robot commented Apr 10, 2026

Uh oh!

zalando-robot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aslafy-z commented Apr 10, 2026

Problem description

Linked issues

Checklist

Uh oh!

zalando-robot commented Apr 10, 2026

Uh oh!

zalando-robot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants