Skip to content

Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18#1799

Open
kongfanshen-0801 wants to merge 3 commits into
apache:mainfrom
kongfanshen-0801:feature/right-semi-join-backport
Open

Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18#1799
kongfanshen-0801 wants to merge 3 commits into
apache:mainfrom
kongfanshen-0801:feature/right-semi-join-backport

Conversation

@kongfanshen-0801
Copy link
Copy Markdown
Contributor

What

Backports two upstream PostgreSQL commits that add Hash Right Semi Join:

  • aa86129e1 — Support "Right Semi Join" plan shapes
  • 5668a857d — Fix right-semi-joins in HashJoin rescans

This lets the planner build the hash table on the smaller (LHS) side of an
IN/EXISTS semijoin instead of always hashing the inner relation.

Cloudberry/GPDB-specific adaptations

  • nodes.h: JOIN_RIGHT_SEMI is appended at the end of the JoinType
    enum rather than in upstream's mid-list position. Inserting it mid-list
    shifts the integer values of the GPDB-only JOIN_DEDUP_SEMI/_REVERSE and
    JOIN_UNIQUE_* codes, which corrupts MPP motion planning and crashes during
    dispatch (SIGSEGV in setupCdbProcessList). Appending keeps every existing
    value stable.
  • cdbpath.c (cdbpath_motion_for_join, serial + parallel switches):
    handle JOIN_RIGHT_SEMI like JOIN_RIGHT/JOIN_RIGHT_ANTI — the inner
    (build) side must not be replicated, since a right-semi join emits
    build-side rows.
  • joinrels.c: add the JOIN_RIGHT_SEMI path alongside JOIN_SEMI while
    preserving the existing GPDB JOIN_DEDUP_SEMI handling.

Testing

  • Functional: Hash Right Semi Join is chosen for small-build-side semijoins;
    results verified correct (dedup semantics, rescan correctness, MPP execution
    across segments).
  • Regression: the join test expected output is still being reconciled
    against CI's canonical environment — hence this PR is opened as a draft.

Notes

  • Only affects the PostgreSQL planner (optimizer=off); GPORCA does not
    generate JOIN_RIGHT_SEMI.

@kongfanshen-0801 kongfanshen-0801 marked this pull request as ready for review June 3, 2026 08:26
@my-ship-it
Copy link
Copy Markdown
Contributor

Thanks for the back porting. Could we keep the original commit history, thanks

Richard Guo and others added 3 commits June 4, 2026 10:01
Hash joins can support semijoin with the LHS input on the right, using
the existing logic for inner join, combined with the assurance that only
the first match for each inner tuple is considered, which can be
achieved by leveraging the HEAP_TUPLE_HAS_MATCH flag.  This can be very
useful in some cases since we may now have the option to hash the
smaller table instead of the larger.

Merge join could likely support "Right Semi Join" too.  However, the
benefit of swapping inputs tends to be small here, so we do not address
that in this patch.

Note that this patch also modifies a test query in join.sql to ensure it
continues testing as intended.  With this patch the original query would
result in a right-semi-join rather than semi-join, compromising its
original purpose of testing the fix for neqjoinsel's behavior for
semi-joins.

Author: Richard Guo
Reviewed-by: wenhui qiu, Alena Rybakina, Japin Li
Discussion: https://postgr.es/m/CAMbWs4_X1mN=ic+SxcyymUqFx9bB8pqSLTGJ-F=MHy4PW3eRXw@mail.gmail.com
(cherry picked from commit aa86129e19d704afb93cb84ab9638f33d266ee9d)
When resetting a HashJoin node for rescans, if it is a single-batch
join and there are no parameter changes for the inner subnode, we can
just reuse the existing hash table without rebuilding it.  However,
for join types that depend on the inner-tuple match flags in the hash
table, we need to reset these match flags to avoid incorrect results.
This applies to right, right-anti, right-semi, and full joins.

When I introduced "Right Semi Join" plan shapes in aa86129e1, I failed
to reset the match flags in the hash table for right-semi joins in
rescans.  This oversight has been shown to produce incorrect results.
This patch fixes it.

Author: Richard Guo
Discussion: https://postgr.es/m/CAMbWs4-nQF9io2WL2SkD0eXvfPdyBc9Q=hRwfQHCGV2usa0jyA@mail.gmail.com
(cherry picked from commit 5668a857de4f3f12066b2bbc626b77be4fc95ee5)
This commit carries the Cloudberry/Greenplum-specific changes needed on
top of the two cherry-picked upstream commits (aa86129e1, 5668a857d),
which only touch the upstream PostgreSQL planner/executor files.

- nodes.h: move JOIN_RIGHT_SEMI to the END of the JoinType enum. Upstream
  places it next to JOIN_RIGHT_ANTI, but in the Cloudberry tree that shifts
  the integer values of the GPDB-specific JOIN_DEDUP_SEMI/REVERSE and
  JOIN_UNIQUE_* codes. Value-dependent code then corrupts MPP motion
  planning, producing a degenerate plan ("Gather Motion 0:1" /
  "Redistribute Motion 1:0") that crashes with SIGSEGV in
  setupCdbProcessList() during dispatch. Appending keeps every pre-existing
  enum value stable.

- cdbpath.c (cdbpath_motion_for_join, both the serial and parallel switch):
  handle JOIN_RIGHT_SEMI like JOIN_RIGHT/JOIN_RIGHT_ANTI. A right-semi join
  emits inner (build-side) rows, so the inner side must not be replicated,
  otherwise matched inner rows could be emitted more than once. Without this
  the new join type would hit the switch default and elog(ERROR,
  "unexpected join type") at plan time.

Note: this feature is only exercised by the PostgreSQL planner
(optimizer=off); GPORCA does not generate JOIN_RIGHT_SEMI.
@kongfanshen-0801 kongfanshen-0801 force-pushed the feature/right-semi-join-backport branch from a869ab3 to 4328d2f Compare June 4, 2026 02:04
@kongfanshen-0801
Copy link
Copy Markdown
Contributor Author

Thanks for the review! I've restructured the branch to preserve the original commit history:

  • The two upstream PostgreSQL commits are now cherry-picked with their original author (Richard Guo), original commit messages, and (cherry picked from commit ...) provenance lines preserved:
    • Support "Right Semi Join" plan shapes (aa86129e1)
    • Fix right-semi-joins in HashJoin rescans (5668a857d)
  • The Cloudberry/Greenplum-specific adaptations (moving JOIN_RIGHT_SEMI to the end of the JoinType enum to keep the GPDB enum values ABI-stable, and handling the new join type in cdbpath_motion_for_join) are isolated in a separate follow-up commit.

The code logic is unchanged from what I verified locally (Hash Right Semi Join chosen and correct via EXPLAIN ANALYZE + cross-validation). PTAL, thanks!

@my-ship-it
Copy link
Copy Markdown
Contributor

Thanks for the review! I've restructured the branch to preserve the original commit history:

  • The two upstream PostgreSQL commits are now cherry-picked with their original author (Richard Guo), original commit messages, and (cherry picked from commit ...) provenance lines preserved:

    • Support "Right Semi Join" plan shapes (aa86129e1)
    • Fix right-semi-joins in HashJoin rescans (5668a857d)
  • The Cloudberry/Greenplum-specific adaptations (moving JOIN_RIGHT_SEMI to the end of the JoinType enum to keep the GPDB enum values ABI-stable, and handling the new join type in cdbpath_motion_for_join) are isolated in a separate follow-up commit.

The code logic is unchanged from what I verified locally (Hash Right Semi Join chosen and correct via EXPLAIN ANALYZE + cross-validation). PTAL, thanks!

Thanks, if have a chance, we could do some comparison of before and after for TPC-H Q 21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants