Skip to content

[Improve][Transform] Replace Transform to support array-based replace_fields#10712

Open
MyeoungDev wants to merge 11 commits intoapache:devfrom
MyeoungDev:improve/replace-transform
Open

[Improve][Transform] Replace Transform to support array-based replace_fields#10712
MyeoungDev wants to merge 11 commits intoapache:devfrom
MyeoungDev:improve/replace-transform

Conversation

@MyeoungDev
Copy link
Copy Markdown
Contributor

@MyeoungDev MyeoungDev commented Apr 5, 2026

Purpose of this pull request

Close #10711

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the SeaTunnel v2 Replace transform to support replacing multiple fields in a single transform via the new replace_fields configuration (list/array), while keeping backward compatibility with the legacy replace_field key. It also adds unit/E2E coverage and updates the EN/ZH documentation accordingly.

Changes:

  • Replace transform config migrated from single replace_field to list-based replace_fields (with fallback to replace_field).
  • Replace transform implementation updated to apply replacement across multiple configured columns (including regex support).
  • Added/updated unit tests, E2E job configs, integration tests, and documentation to reflect the new config.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/replace/ReplaceTransform.java Implements multi-field replacement and regex pattern pre-compilation.
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/replace/ReplaceTransformConfig.java Introduces replace_fields as a list option with fallback to replace_field.
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/replace/ReplaceTransformFactory.java Updates factory identifier and option rule to surface the new option.
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/replace/ReplaceMultiCatalogTransform.java Aligns plugin name usage with shared constant.
seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/replace/ReplaceTransformTest.java Adds unit tests covering list-based replacement, regex modes, and error cases.
seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/exception/TransformErrorTest.java Updates error-path test to use the new option key.
seatunnel-e2e/.../resources/replace/*.conf Adds E2E configs for backward compatibility and multi-field/multi-table scenarios.
seatunnel-e2e/.../TestReplaceIT.java Expands IT coverage and updates job resource paths.
docs/en/transforms/replace.md Documents replace_fields and notes legacy replace_field compatibility.
docs/zh/transforms/replace.md Same as EN doc updates for Chinese documentation.
Comments suppressed due to low confidence (1)

seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/replace/ReplaceTransformFactory.java:49

  • The conditional rule currently makes replace_first required when is_regex=true, but the docs and code treat it as optional (defaulting to false). Either make replace_first optional with a default of false, or drop the conditional requirement so configs with is_regex=true don’t fail validation unnecessarily.
                .optional(ReplaceTransformConfig.KEY_IS_REGEX)
                .conditional(
                        ReplaceTransformConfig.KEY_IS_REGEX,
                        true,
                        ReplaceTransformConfig.KEY_REPLACE_FIRST)
                .optional(TransformCommonOptions.MULTI_TABLES)
                .optional(TransformCommonOptions.TABLE_MATCH_REGEX)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

DanielLeens

This comment was marked as resolved.

DanielLeens

This comment was marked as duplicate.

DanielLeens

This comment was marked as outdated.

DanielLeens

This comment was marked as outdated.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MyeoungDev,

I reviewed the current HEAD locally. The overall direction makes sense, and the main execution path is wired correctly for the new multi-field mode:
ReplaceTransformFactory -> ReplaceMultiCatalogTransform -> AbstractMultiCatalogTransform -> ReplaceTransform -> transformRow().
So replace_fields = ["name", "title"] will hit the normal runtime path for both the single-table and multi-table cases.

I still have one blocking issue before merge:

  1. replace_fields = [] is currently accepted and turns the transform into a silent no-op.
    ReplaceTransformConfig.KEY_REPLACE_FIELDS is marked as required, but ReplaceTransform only does addAll(...) and then builds replaceFieldIndexes from that list (ReplaceTransform.java:50-79). If the resolved list is empty, transformRow() just copies the row and returns it unchanged (ReplaceTransform.java:97-106).
    The old single-field contract failed fast when the target field was invalid; the new list-based API should also fail fast when the resolved field list is empty, otherwise a misconfigured job will look "successful" while doing nothing.

One more follow-up I strongly recommend:

  1. The legacy replace_field multi-table compatibility path is not really asserted by the current E2E coverage.
    testReplaceMultiTable() only checks the job exit code (TestReplaceIT.java:37-43), and replace_transform_multi_table.conf only asserts NOT_NULL for the transformed tables (replace_transform_multi_table.conf:117-138). That means the test can stay green even if the old-key multi-table path stops replacing values.
    Please tighten that assertion so the backward-compatibility claim is actually protected.

Because of item 1, I am still at request-changes for now.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I walked the new Replace execution path locally and found one blocking regression.

ReplaceTransform.initializeFieldIndexes() now resolves target indexes from inputCatalogTable.getTableSchema().getColumns() and transformRow() uses those indexes directly against SeaTunnelRow. That is different from the previous implementation, which resolved the target field through toPhysicalRowDataType().indexOf(...).

Because SeaTunnelRow is aligned with the physical row layout, a schema that contains non-physical columns can now hit the wrong slot or fail at runtime. I reproduced this locally with a schema like [MetadataColumn, name, title]: targeting title throws ArrayIndexOutOfBoundsException from ReplaceTransform.java:100.

Could you switch the index resolution back to the physical row layout (for example, derive indexes from toPhysicalRowDataType() or filter Column::isPhysical() before building the index map) and add a regression test for a schema that includes a non-physical column?

Non-blocking note: transformRow() now copies the whole row on every record, which is also a hot-path allocation regression compared with the previous reuse path.

DanielLeens

This comment was marked as duplicate.

@MyeoungDev
Copy link
Copy Markdown
Contributor Author

Hi @DanielLeens
Thanks for the detailed review!

Here's a summary of what was changed

  1. Factory option rule required → optional
    Changed replace_fields, pattern, and replacement from required to optional in optionRule(), and removed the conditional on replace_first.

  2. Empty replace_fields validation
    Added an explicit check that throws TransformException immediately when replace_fields is an empty list

  3. Null safety for required options
    Added null checks via getRequiredOption() for replace_fields, pattern, and replacement to prevent NPE when required options are missing.

  4. Physical row fix
    That's where I missed it. thanks for catching it
    Switched initializeFieldIndexes() to use toPhysicalRowDataType() instead of getColumns()

  5. transformRow in-place mutation
    This was flagged by the Copilot review as well.
    Removed the per-record inputRow.copy() and updated fields in-place directly.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-checked the latest HEAD locally.

The blocking issues from my previous reviews look addressed now:

  • ReplaceTransformFactory.optionRule() no longer forces the top-level fields, so the table_transform path is not rejected during factory validation.
  • replace_first is now truly optional again.
  • ReplaceTransform fails fast for replace_fields = [] instead of silently becoming a no-op.
  • field indexes are resolved from toPhysicalRowDataType() again, so the runtime path is back in sync with the physical SeaTunnelRow layout.

I still recommend tightening the legacy multi-table E2E assertions in a follow-up, because the current coverage mostly proves job success / NOT_NULL rather than the exact replacement result. But that is non-blocking from my side.

Thanks for turning the feedback around quickly.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-checked the current HEAD after the latest merge commit.

For the Replace files I previously reviewed, I do not see a new functional delta after the revision I already approved:

  • the ReplaceTransform / factory / tests I checked are unchanged from the approved revision
  • the latest HEAD does not reintroduce the earlier config-contract / physical-row-index blockers from my side

So there is no new blocker from me on the current Replace implementation.

@MyeoungDev
Copy link
Copy Markdown
Contributor Author

Hi @davidzollo

Following your comment, I added validation for the case where both replace_field and replace_fields are configured.

ReadonlyConfig.getOptional() checks the primary key first and only falls back when it is absent, so replace_fields is used when both are present.

I added a validation using getSourceMap() and a unit test for this case.

Could you please review?

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled the current HEAD locally and re-checked the only new delta after my last approval: the transform now explicitly rejects configs that set both replace_field and replace_fields, and the added test covers that conflict path.

I do not see a new blocker from this follow-up. The remaining Build check is still queued on GitHub, but from the latest code delta my previous approval still stands.

Copy link
Copy Markdown
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 if CI passes.
Thanks for your contribution.

@MyeoungDev
Copy link
Copy Markdown
Contributor Author

+1 if CI passes. Thanks for your contribution.

Thanks for the review!
I'm really happy contribute to SeaTunnel :)

@DanielLeens
Copy link
Copy Markdown

Thanks for the follow-up. I re-checked the current HEAD locally.

There is no new functional delta after the revision I previously approved, and the latest author reply does not introduce a new technical issue. From my side the previous approval still stands.

The remaining blocker is the current Build failure on GitHub.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-pulled the current HEAD locally after the latest update.

Incremental check result:

  • the newest commit is a merge from upstream/dev
  • I compared the PR-specific diff before and after that merge (c3fa3be6 vs current HEAD), and the 13 PR-owned files are unchanged
  • the runtime path I reviewed previously is still the same:
ReplaceTransformFactory
  -> ReplaceMultiCatalogTransform
  -> ReplaceTransform.initializeFieldIndexes()
  -> ReplaceTransform.transformRow()

So my previous code-level conclusion still stands:

  • replace_fields multi-column support is still wired correctly
  • the replace_field fallback compatibility is still preserved
  • I do not see a new blocker in the current code delta

The only remaining item on this revision is the GitHub Build check, which is still pending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improve][Transform] Replace Transform to support array-based replace_fields

4 participants