Fix for CBDB expand with concurrent distributed tx by reshke · Pull Request #1801 · apache/cloudberry

reshke · 2026-06-03T15:06:46Z

gpexpand currently fails when we have distributed transaction in coordinator WAL, bacause we create new segment as copy of coordinator datadir

Don't care of distributed tx redo in case of first startup after gpexpand.

newly created segment may contain records from primary in xlog, so skip them.

reshke · 2026-06-03T15:11:12Z

Will recheck for main branch too, soon

my-ship-it

Good catch! LGTM

reshke · 2026-06-03T19:03:04Z

Looks like this was already fixed in fc8aab8, but then re-introduced

reshke · 2026-06-03T20:45:10Z

Looks like this was already fixed in fc8aab8, but then re-introduced

quite interesting that this does not reproduce for me on master without additional CHECKPOINT spam, but does reproduce on REL_2_STABLE.

16.9 repro:

reshke@yezzey-cbdb-bench:~/cloudberry$ /usr/local/gpdb/bin/postgres --single postgres  -D /home/reshke/cloudberry/gpAux/gpdemo/datadirs/demoDataDir3
2026-06-03 20:44:28.477343 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","database system was interrupted while in recovery at 2026-06-03 20:44:16 UTC",,"This probably means that some data is corrupted and you will have to use the last backup for recovery.",,,,,,"StartupXLOG","xlog.c",5380,
2026-06-03 20:44:28.477589 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","Synchronization of the wal directory starts.",,,,,,,,"SyncAllXLogFiles","fd.c",3673,
2026-06-03 20:44:28.477746 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","synchronization of the wal directory finishes.",,,,,,,,"SyncAllXLogFiles","fd.c",3675,
2026-06-03 20:44:28.477848 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","restarting backup recovery with redo LSN 0/10000110",,,,,,,,"InitWalRecovery","xlogrecovery.c",794,
2026-06-03 20:44:28.478354 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,,"InitWalRecovery","xlogrecovery.c",943,
2026-06-03 20:44:28.481791 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","redo starts at 0/10000110",,,,,,,,"PerformWalRecovery","xlogrecovery.c",1760,
2026-06-03 20:44:28.481927 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"FATAL","XX000","the limit of 0 distributed transactions has been reached while adding gid = 191. Committed gid array length: 0, dump:
","It should not happen. Temporarily increase max_connections (need postmaster reboot) on the postgres (master or standby) to work around this issue and then report a bug",,,,"WAL redo at 0/10000110 for Transaction/DISTRIBUTED_COMMIT: distributed commit 2026-06-03 20:43:37.46535+00 gxid = 191",,,"redoDistributedCommitRecord","cdbdtxrecovery.c",529,1    0x60f1761b9e56 postgres errstart + 0x286
2    0x60f175aed5b7 postgres <symbol not found> + 0x75aed5b7
3    0x60f175bcaa43 postgres xact_redo + 0x293
4    0x60f175be415a postgres PerformWalRecovery + 0x3fa
5    0x60f175bd5584 postgres StartupXLOG + 0x364
6    0x60f1761ce53a postgres InitPostgres + 0x26a
7    0x60f176017041 postgres PostgresMain + 0x101
8    0x60f1760199ce postgres PostgresSingleUserMain + 0xfe
9    0x60f175aee7c5 postgres main + 0x605
10   0x75fca362a1ca libc.so.6 <symbol not found> + 0xa362a1ca
11   0x75fca362a28b libc.so.6 __libc_start_main + 0x8b
12   0x60f175af23d5 postgres _start + 0x25

reshke@yezzey-cbdb-bench:~/cloudberry$ /usr/local/gpdb/bin/postgres --version
postgres (Apache Cloudberry) 16.9

reshke · 2026-06-03T20:52:05Z

same fix for main #1803

An duct tape for this was already added as fc8aab8, through redo path was not patched there. Copy same logic into redoDistributedCommitRecord function boby.

reshke · 2026-06-03T21:13:52Z

Looks like this was already fixed in fc8aab8, but then re-introduced

looks like fix was incomplete back then too, but I didnt check

An duct tape for this was already added as fc8aab8, through redo path was not patched there. Copy same guard into redoDistributedCommitRecord function boby.

reshke changed the title ~~Fix for CDBD expand with concurrent distributed tx~~ Fix for CBDB expand with concurrent distributed tx Jun 3, 2026

reshke requested review from my-ship-it and yjhjstz June 3, 2026 15:18

my-ship-it approved these changes Jun 3, 2026

View reviewed changes

vovik0134 reviewed Jun 3, 2026

View reviewed changes

Comment thread src/backend/cdb/cdbdtxrecovery.c

x4m approved these changes Jun 3, 2026

View reviewed changes

reshke requested a review from jiaqizho June 3, 2026 16:10

reshke mentioned this pull request Jun 3, 2026

Fix for CDBD expand with concurrent distributed tx #1803

Merged

Fix for CDBD expand with concurrent distributed tx

096bbba

An duct tape for this was already added as fc8aab8, through redo path was not patched there. Copy same logic into redoDistributedCommitRecord function boby.

reshke force-pushed the REL_2_STABLE branch from 6a31586 to 096bbba Compare June 3, 2026 21:00

reshke merged commit 0fd99be into apache:REL_2_STABLE Jun 4, 2026
45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for CBDB expand with concurrent distributed tx#1801

Fix for CBDB expand with concurrent distributed tx#1801
reshke merged 1 commit into
apache:REL_2_STABLEfrom
reshke:REL_2_STABLE

reshke commented Jun 3, 2026 •

edited

Loading

Uh oh!

reshke commented Jun 3, 2026

Uh oh!

my-ship-it left a comment

Uh oh!

Uh oh!

reshke commented Jun 3, 2026

Uh oh!

reshke commented Jun 3, 2026 •

edited

Loading

Uh oh!

reshke commented Jun 3, 2026

Uh oh!

reshke commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

reshke commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reshke commented Jun 3, 2026

Uh oh!

my-ship-it left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

reshke commented Jun 3, 2026

Uh oh!

reshke commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reshke commented Jun 3, 2026

Uh oh!

reshke commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

reshke commented Jun 3, 2026 •

edited

Loading

reshke commented Jun 3, 2026 •

edited

Loading