Skip to content

feat: support new ComparisonJoinKey field in HashJoin/MergeJoin#845

Open
nielspardon wants to merge 2 commits into
substrait-io:mainfrom
nielspardon:feat/join-comparison-keys
Open

feat: support new ComparisonJoinKey field in HashJoin/MergeJoin#845
nielspardon wants to merge 2 commits into
substrait-io:mainfrom
nielspardon:feat/join-comparison-keys

Conversation

@nielspardon

@nielspardon nielspardon commented Jun 4, 2026

Copy link
Copy Markdown
Member

Summary

The deprecated left_keys/right_keys fields of HashJoinRel and MergeJoinRel are being replaced upstream by repeated ComparisonJoinKey keys = 8. This PR migrates substrait-java to consume both the old and new proto fields, always produce the new keys field, and additionally produce the deprecated fields when doing so is lossless (plain EQ joins). This keeps it working against the current proto, preserves compatibility with consumers that have not yet adopted keys, and is ready once the deprecated fields are removed.

The fields were deprecated in Jan 2024 via substrait-io/substrait#585 and thus deprecated in Substrait since v0.42.0.

Changes

  • Add a ComparisonJoinKey POJO modelling the proto message, including the ComparisonType oneof (SimpleComparison / CustomComparison) and the SimpleComparisonType enum, following the existing WindowBound oneof and Join.JoinType enum conventions.
  • HashJoin / MergeJoin now expose getKeys() as the source of truth. getLeftKeys() / getRightKeys() are kept as @Deprecated derived convenience views so existing consumers keep compiling.
  • Writer (RelProtoConverter): always emits keys, and also emits the deprecated left_keys / right_keys when every key is a plain EQ comparison those fields can represent without loss. Keys using IS_NOT_DISTINCT_FROM, MIGHT_EQUAL, or a custom comparison are intentionally left out of the deprecated fields, so an old consumer cannot silently misinterpret them as equality.
  • Reader (ProtoRelConverter): prefers keys when present, falling back to the deprecated fields (paired with a SIMPLE_COMPARISON_TYPE_EQ comparison) when keys is empty.
  • Update RelCopyOnWriteVisitor and the SubstraitBuilder DSL accordingly (public DSL signatures unchanged).

This dual-write-when-lossless policy mirrors the approach taken for the URI → URN migration, and matches the feedback on the analogous substrait-go#256.

Tests

New HashMergeJoinKeysTest covering:

  • round trip of hash/merge joins,
  • produce-both-for-EQ (an all-EQ join emits keys and matching deprecated fields),
  • produce-only-new-for-lossy (a join with a non-EQ key emits only keys),
  • consume-legacy (old fields → EQ keys),
  • precedence (keys wins when both old and new are set),
  • full-fidelity round trip including a non-EQ simple comparison and a custom comparison function.

🤖 Generated with Claude Code

The deprecated `left_keys`/`right_keys` fields of `HashJoinRel` and
`MergeJoinRel` are being replaced upstream by `repeated ComparisonJoinKey
keys`. Migrate substrait-java to consume both the old and new proto fields
while only producing the new `keys` field, so it keeps working today and is
ready once the deprecated fields are removed.

- Add a `ComparisonJoinKey` POJO modelling the proto message, including the
  `ComparisonType` oneof (`SimpleComparison`/`CustomComparison`) and the
  `SimpleComparisonType` enum.
- `HashJoin`/`MergeJoin` now expose `getKeys()` as the source of truth;
  `getLeftKeys()`/`getRightKeys()` are kept as `@Deprecated` derived views.
- Writer emits only `keys`; reader prefers `keys` and falls back to the
  deprecated fields (paired with an EQ comparison) when `keys` is empty.
- Update `RelCopyOnWriteVisitor` and the `SubstraitBuilder` DSL accordingly.
- Add round-trip, legacy-consumption, precedence and full-fidelity tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Writing only the new `keys` field breaks consumers that have not yet
adopted it. Follow the same lossless dual-write policy used elsewhere:
keep `keys` as the canonical representation, but additionally populate
the deprecated `left_keys`/`right_keys` when every key is a plain `EQ`
comparison those fields can represent without loss.

Keys using `IS_NOT_DISTINCT_FROM`, `MIGHT_EQUAL`, or custom comparisons
are intentionally left out of the deprecated fields, since an old
consumer would silently misinterpret them as equality.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nielspardon

nielspardon commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Updated the writer to follow the lossless dual-write policy raised in substrait-go#256, which applies equally here.

Previously the writer emitted only the new keys field, which would break consumers that haven't yet adopted it. Now:

  • keys is always written (canonical representation).
  • The deprecated left_keys/right_keys are also written, but only when every key is a plain EQ comparison those fields can represent without loss.
  • Keys using IS_NOT_DISTINCT_FROM, MIGHT_EQUAL, or a custom comparison are intentionally left out of the deprecated fields, so an old consumer can't silently misinterpret them as equality.

The reader side already preferred keys and fell back to the deprecated fields (as EQ), so no change was needed there. Tests updated accordingly: a new case asserts EQ joins emit both representations, and another asserts lossy comparisons emit only keys.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant