Skip to content

fix(logical-backup): wait for PG connectivity before running backup#3069

Open
aslafy-z wants to merge 2 commits intozalando:masterfrom
aslafy-z:fix/logical-backup-wait-for-connectivity
Open

fix(logical-backup): wait for PG connectivity before running backup#3069
aslafy-z wants to merge 2 commits intozalando:masterfrom
aslafy-z:fix/logical-backup-wait-for-connectivity

Conversation

@aslafy-z
Copy link
Copy Markdown

Problem description

The logical-backup script (dump.sh) connects to the target PostgreSQL pod immediately after resolving its IP via the Kubernetes API. When NetworkPolicy is enforced via iptables, a newly-created pod's IP may not yet be present in the destination node's ingress allow lists, causing cross-node connections to be rejected until the next policy sync.

This manifests as intermittent Connection refused errors even though the target PostgreSQL pod is healthy and listening:

psql: error: connection to server at "x.x.x.x", port 5432 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?

The rejection happens at the network layer on the destination node, not at PostgreSQL. The race window is typically under 1 second but is enough to cause consistent failures because dump.sh connects with zero delay after pod startup.

This PR adds a pg_isready retry loop before the dump starts. It returns as soon as the connection succeeds, adding near-zero overhead when connectivity is immediate. Retry count and delay are configurable via LOGICAL_BACKUP_CONNECT_RETRIES (default: 10) and LOGICAL_BACKUP_CONNECT_RETRY_DELAY (default: 2s), compatible with logical_backup_cronjob_environment_secret.

Linked issues

Checklist

  • Your go code is formatted. Your IDE should do it automatically for you.
    • N/A: shell script change only
  • You have updated generated code when introducing new fields to the acid.zalan.do api package.
    • N/A: no API changes
  • New configuration options are reflected in CRD validation, helm charts and sample manifests.
    • N/A: env vars are optional with defaults, passed through existing logical_backup_cronjob_environment_secret
  • New functionality is covered by unit and/or e2e tests.
    • Tested locally with Docker PostgreSQL: immediate return when PG reachable, correct failure after retries, correct retry on delayed PG start, correct handling of custom env vars
  • You have checked existing open PRs for possible overlay and referenced them.
    • No overlapping PRs found

The backup script connects to the target PostgreSQL pod immediately
after resolving its IP via the Kubernetes API. When NetworkPolicy is
enforced via iptables, a newly-created pod's IP may not yet be present
in the destination node's ingress allow lists, causing cross-node
connections to be rejected until the next policy sync.

This adds a pg_isready retry loop before the dump starts, with
configurable retries and delay via LOGICAL_BACKUP_CONNECT_RETRIES
(default: 10) and LOGICAL_BACKUP_CONNECT_RETRY_DELAY (default: 2s).

Signed-off-by: Zadkiel AHARONIAN <zaharonian@ccl-consulting.fr>
@zalando-robot
Copy link
Copy Markdown

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

Document the new environment variables that control the pg_isready
retry loop added in the previous commit. These are passed via the
existing logical_backup_cronjob_environment_secret mechanism.

Signed-off-by: Zadkiel AHARONIAN <zaharonian@ccl-consulting.fr>
@zalando-robot
Copy link
Copy Markdown

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants