Skip to content

Fix for CBDB expand with concurrent distributed tx#1801

Merged
reshke merged 1 commit into
apache:REL_2_STABLEfrom
reshke:REL_2_STABLE
Jun 4, 2026
Merged

Fix for CBDB expand with concurrent distributed tx#1801
reshke merged 1 commit into
apache:REL_2_STABLEfrom
reshke:REL_2_STABLE

Conversation

@reshke
Copy link
Copy Markdown
Contributor

@reshke reshke commented Jun 3, 2026

fixes #1800

gpexpand currently fails when we have distributed transaction in coordinator WAL, bacause we create new segment as copy of coordinator datadir

Don't care of distributed tx redo in case of first startup after gpexpand.

newly created segment may contain records from primary in xlog, so skip them.

@reshke
Copy link
Copy Markdown
Contributor Author

reshke commented Jun 3, 2026

Will recheck for main branch too, soon

@reshke reshke changed the title Fix for CDBD expand with concurrent distributed tx Fix for CBDB expand with concurrent distributed tx Jun 3, 2026
@reshke reshke requested review from my-ship-it and yjhjstz June 3, 2026 15:18
Copy link
Copy Markdown
Contributor

@my-ship-it my-ship-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! LGTM

Comment thread src/backend/cdb/cdbdtxrecovery.c
@reshke reshke requested a review from jiaqizho June 3, 2026 16:10
@reshke
Copy link
Copy Markdown
Contributor Author

reshke commented Jun 3, 2026

Looks like this was already fixed in fc8aab8, but then re-introduced

@reshke
Copy link
Copy Markdown
Contributor Author

reshke commented Jun 3, 2026

Looks like this was already fixed in fc8aab8, but then re-introduced

quite interesting that this does not reproduce for me on master without additional CHECKPOINT spam, but does reproduce on REL_2_STABLE.

16.9 repro:

reshke@yezzey-cbdb-bench:~/cloudberry$ /usr/local/gpdb/bin/postgres --single postgres  -D /home/reshke/cloudberry/gpAux/gpdemo/datadirs/demoDataDir3
2026-06-03 20:44:28.477343 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","database system was interrupted while in recovery at 2026-06-03 20:44:16 UTC",,"This probably means that some data is corrupted and you will have to use the last backup for recovery.",,,,,,"StartupXLOG","xlog.c",5380,
2026-06-03 20:44:28.477589 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","Synchronization of the wal directory starts.",,,,,,,,"SyncAllXLogFiles","fd.c",3673,
2026-06-03 20:44:28.477746 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","synchronization of the wal directory finishes.",,,,,,,,"SyncAllXLogFiles","fd.c",3675,
2026-06-03 20:44:28.477848 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","restarting backup recovery with redo LSN 0/10000110",,,,,,,,"InitWalRecovery","xlogrecovery.c",794,
2026-06-03 20:44:28.478354 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,,"InitWalRecovery","xlogrecovery.c",943,
2026-06-03 20:44:28.481791 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"LOG","00000","redo starts at 0/10000110",,,,,,,,"PerformWalRecovery","xlogrecovery.c",1760,
2026-06-03 20:44:28.481927 UTC,,,p1083267,th-1554549696,,,,0,,,seg3,,,,,"FATAL","XX000","the limit of 0 distributed transactions has been reached while adding gid = 191. Committed gid array length: 0, dump:
","It should not happen. Temporarily increase max_connections (need postmaster reboot) on the postgres (master or standby) to work around this issue and then report a bug",,,,"WAL redo at 0/10000110 for Transaction/DISTRIBUTED_COMMIT: distributed commit 2026-06-03 20:43:37.46535+00 gxid = 191",,,"redoDistributedCommitRecord","cdbdtxrecovery.c",529,1    0x60f1761b9e56 postgres errstart + 0x286
2    0x60f175aed5b7 postgres <symbol not found> + 0x75aed5b7
3    0x60f175bcaa43 postgres xact_redo + 0x293
4    0x60f175be415a postgres PerformWalRecovery + 0x3fa
5    0x60f175bd5584 postgres StartupXLOG + 0x364
6    0x60f1761ce53a postgres InitPostgres + 0x26a
7    0x60f176017041 postgres PostgresMain + 0x101
8    0x60f1760199ce postgres PostgresSingleUserMain + 0xfe
9    0x60f175aee7c5 postgres main + 0x605
10   0x75fca362a1ca libc.so.6 <symbol not found> + 0xa362a1ca
11   0x75fca362a28b libc.so.6 __libc_start_main + 0x8b
12   0x60f175af23d5 postgres _start + 0x25

reshke@yezzey-cbdb-bench:~/cloudberry$ /usr/local/gpdb/bin/postgres --version
postgres (Apache Cloudberry) 16.9

@reshke
Copy link
Copy Markdown
Contributor Author

reshke commented Jun 3, 2026

same fix for main #1803

An duct tape for this was already added as fc8aab8, through redo
path was not patched there. Copy same logic into
redoDistributedCommitRecord function boby.
@reshke
Copy link
Copy Markdown
Contributor Author

reshke commented Jun 3, 2026

Looks like this was already fixed in fc8aab8, but then re-introduced

looks like fix was incomplete back then too, but I didnt check

@reshke reshke merged commit 0fd99be into apache:REL_2_STABLE Jun 4, 2026
45 checks passed
reshke added a commit to open-gpdb/cloudberry that referenced this pull request Jun 4, 2026
An duct tape for this was already added as fc8aab8, through redo
path was not patched there. Copy same guard into
redoDistributedCommitRecord function boby.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants