[Improve][Transform] Replace Transform to support array-based replace_fields#10712
[Improve][Transform] Replace Transform to support array-based replace_fields#10712MyeoungDev wants to merge 11 commits intoapache:devfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the SeaTunnel v2 Replace transform to support replacing multiple fields in a single transform via the new replace_fields configuration (list/array), while keeping backward compatibility with the legacy replace_field key. It also adds unit/E2E coverage and updates the EN/ZH documentation accordingly.
Changes:
- Replace transform config migrated from single
replace_fieldto list-basedreplace_fields(with fallback toreplace_field). - Replace transform implementation updated to apply replacement across multiple configured columns (including regex support).
- Added/updated unit tests, E2E job configs, integration tests, and documentation to reflect the new config.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/replace/ReplaceTransform.java | Implements multi-field replacement and regex pattern pre-compilation. |
| seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/replace/ReplaceTransformConfig.java | Introduces replace_fields as a list option with fallback to replace_field. |
| seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/replace/ReplaceTransformFactory.java | Updates factory identifier and option rule to surface the new option. |
| seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/replace/ReplaceMultiCatalogTransform.java | Aligns plugin name usage with shared constant. |
| seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/replace/ReplaceTransformTest.java | Adds unit tests covering list-based replacement, regex modes, and error cases. |
| seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/exception/TransformErrorTest.java | Updates error-path test to use the new option key. |
| seatunnel-e2e/.../resources/replace/*.conf | Adds E2E configs for backward compatibility and multi-field/multi-table scenarios. |
| seatunnel-e2e/.../TestReplaceIT.java | Expands IT coverage and updates job resource paths. |
| docs/en/transforms/replace.md | Documents replace_fields and notes legacy replace_field compatibility. |
| docs/zh/transforms/replace.md | Same as EN doc updates for Chinese documentation. |
Comments suppressed due to low confidence (1)
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/replace/ReplaceTransformFactory.java:49
- The conditional rule currently makes
replace_firstrequired whenis_regex=true, but the docs and code treat it as optional (defaulting to false). Either makereplace_firstoptional with a default of false, or drop the conditional requirement so configs withis_regex=truedon’t fail validation unnecessarily.
.optional(ReplaceTransformConfig.KEY_IS_REGEX)
.conditional(
ReplaceTransformConfig.KEY_IS_REGEX,
true,
ReplaceTransformConfig.KEY_REPLACE_FIRST)
.optional(TransformCommonOptions.MULTI_TABLES)
.optional(TransformCommonOptions.TABLE_MATCH_REGEX)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
DanielLeens
left a comment
There was a problem hiding this comment.
Hi @MyeoungDev,
I reviewed the current HEAD locally. The overall direction makes sense, and the main execution path is wired correctly for the new multi-field mode:
ReplaceTransformFactory -> ReplaceMultiCatalogTransform -> AbstractMultiCatalogTransform -> ReplaceTransform -> transformRow().
So replace_fields = ["name", "title"] will hit the normal runtime path for both the single-table and multi-table cases.
I still have one blocking issue before merge:
replace_fields = []is currently accepted and turns the transform into a silent no-op.
ReplaceTransformConfig.KEY_REPLACE_FIELDSis marked as required, butReplaceTransformonly doesaddAll(...)and then buildsreplaceFieldIndexesfrom that list (ReplaceTransform.java:50-79). If the resolved list is empty,transformRow()just copies the row and returns it unchanged (ReplaceTransform.java:97-106).
The old single-field contract failed fast when the target field was invalid; the new list-based API should also fail fast when the resolved field list is empty, otherwise a misconfigured job will look "successful" while doing nothing.
One more follow-up I strongly recommend:
- The legacy
replace_fieldmulti-table compatibility path is not really asserted by the current E2E coverage.
testReplaceMultiTable()only checks the job exit code (TestReplaceIT.java:37-43), andreplace_transform_multi_table.confonly assertsNOT_NULLfor the transformed tables (replace_transform_multi_table.conf:117-138). That means the test can stay green even if the old-key multi-table path stops replacing values.
Please tighten that assertion so the backward-compatibility claim is actually protected.
Because of item 1, I am still at request-changes for now.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the update. I walked the new Replace execution path locally and found one blocking regression.
ReplaceTransform.initializeFieldIndexes() now resolves target indexes from inputCatalogTable.getTableSchema().getColumns() and transformRow() uses those indexes directly against SeaTunnelRow. That is different from the previous implementation, which resolved the target field through toPhysicalRowDataType().indexOf(...).
Because SeaTunnelRow is aligned with the physical row layout, a schema that contains non-physical columns can now hit the wrong slot or fail at runtime. I reproduced this locally with a schema like [MetadataColumn, name, title]: targeting title throws ArrayIndexOutOfBoundsException from ReplaceTransform.java:100.
Could you switch the index resolution back to the physical row layout (for example, derive indexes from toPhysicalRowDataType() or filter Column::isPhysical() before building the index map) and add a regression test for a schema that includes a non-physical column?
Non-blocking note: transformRow() now copies the whole row on every record, which is also a hot-path allocation regression compared with the previous reuse path.
|
Hi @DanielLeens Here's a summary of what was changed
|
DanielLeens
left a comment
There was a problem hiding this comment.
I re-checked the latest HEAD locally.
The blocking issues from my previous reviews look addressed now:
ReplaceTransformFactory.optionRule()no longer forces the top-level fields, so thetable_transformpath is not rejected during factory validation.replace_firstis now truly optional again.ReplaceTransformfails fast forreplace_fields = []instead of silently becoming a no-op.- field indexes are resolved from
toPhysicalRowDataType()again, so the runtime path is back in sync with the physicalSeaTunnelRowlayout.
I still recommend tightening the legacy multi-table E2E assertions in a follow-up, because the current coverage mostly proves job success / NOT_NULL rather than the exact replacement result. But that is non-blocking from my side.
Thanks for turning the feedback around quickly.
DanielLeens
left a comment
There was a problem hiding this comment.
I re-checked the current HEAD after the latest merge commit.
For the Replace files I previously reviewed, I do not see a new functional delta after the revision I already approved:
- the
ReplaceTransform/ factory / tests I checked are unchanged from the approved revision - the latest HEAD does not reintroduce the earlier config-contract / physical-row-index blockers from my side
So there is no new blocker from me on the current Replace implementation.
…and replace_fields
…improve/replace-transform
|
Hi @davidzollo Following your comment, I added validation for the case where both
I added a validation using Could you please review? |
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled the current HEAD locally and re-checked the only new delta after my last approval: the transform now explicitly rejects configs that set both replace_field and replace_fields, and the added test covers that conflict path.
I do not see a new blocker from this follow-up. The remaining Build check is still queued on GitHub, but from the latest code delta my previous approval still stands.
davidzollo
left a comment
There was a problem hiding this comment.
+1 if CI passes.
Thanks for your contribution.
Thanks for the review! |
|
Thanks for the follow-up. I re-checked the current HEAD locally. There is no new functional delta after the revision I previously approved, and the latest author reply does not introduce a new technical issue. From my side the previous approval still stands. The remaining blocker is the current Build failure on GitHub. |
DanielLeens
left a comment
There was a problem hiding this comment.
I re-pulled the current HEAD locally after the latest update.
Incremental check result:
- the newest commit is a merge from
upstream/dev - I compared the PR-specific diff before and after that merge (
c3fa3be6vs current HEAD), and the 13 PR-owned files are unchanged - the runtime path I reviewed previously is still the same:
ReplaceTransformFactory
-> ReplaceMultiCatalogTransform
-> ReplaceTransform.initializeFieldIndexes()
-> ReplaceTransform.transformRow()
So my previous code-level conclusion still stands:
replace_fieldsmulti-column support is still wired correctly- the
replace_fieldfallback compatibility is still preserved - I do not see a new blocker in the current code delta
The only remaining item on this revision is the GitHub Build check, which is still pending.
Purpose of this pull request
Close #10711
Does this PR introduce any user-facing change?
How was this patch tested?
Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.