[DataLoader] Support setting JVM args when using JNI by robreeves · Pull Request #529 · linkedin/openhouse

robreeves · 2026-04-03T23:07:16Z

Summary

Data loader uses PyIceberg to load data. If the data is on HDFS then PyIceberg uses libhdfs to load data, which uses the JNI. This extends data loader to be able to configure the JVM created when using JNI. This is important for things like the max heap size. For example, if data loader runs on a machine with a lot of phyiscal memory then the default heap size will be much bigger than needed (half of physical memory by default).

This also makes sure that data loader is the first thing to use the JNI. This is important because the JNI will create one JVM per process. If something else uses the JNI it will create the JVM and it will be reused for data loader. The params specified in data loader would be ignored. By making sure data loader is used first, it guarantees the JVM will created with the user-defined params.

Changes

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

I added new unit tests and an integration test case for both the planner and worker jvm args. From the integration tests:

PASS: planner_jvm_args honored by JVM (MaxHeapSize=134217728)
PASS: worker_jvm_args honored by child JVM (MaxHeapSize=266338304)

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

Allow users to pass JVM arguments (e.g. -Xmx512m, -verbose:gc, -D flags) to the JNI JVM used by the HDFS client. Adds planner_jvm_args and worker_jvm_args to DataLoaderContext so planner and worker processes can use different JVM configurations.

Ensures the JVM starts with worker_jvm_args before any UDF registration code can trigger JNI.

Pass -Xmx128m via DataLoaderContext on the first DataLoader and verify at the end that the JVM's MaxHeapSize matches 128 MB.

- Set -Xmx127m via planner_jvm_args on the first DataLoader and verify the planner JVM's MaxHeapSize is capped at 128m. - Spawn a child process to materialize a split with -Xmx254m via worker_jvm_args and verify the worker JVM gets a larger, distinct heap size.

Extract _assert_jvm_heap helper and run planner/worker heap checks before printing the final success message. Also fix ruff formatting.

Move planner_jvm_args and worker_jvm_args into a JvmConfig dataclass to reduce clutter on DataLoaderContext for non-JNI use cases. Usage: DataLoaderContext(jvm=JvmConfig(planner_args="-Xmx2g"))

Copilot

Pull request overview

Adds a client-facing way to configure JVM arguments used by libhdfs/JNI (via LIBHDFS_OPTS), so DataLoader can control JVM settings like max heap size in both planner and worker processes.

Changes:

Introduces JvmConfig on DataLoaderContext and applies planner_args early in OpenHouseDataLoader.__init__.
Propagates worker_args through TableScanContext and applies them when materializing DataLoaderSplits (before potential JNI triggers).
Adds unit and integration tests validating that LIBHDFS_OPTS is set and that -Xmx is honored by planner/worker JVMs.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
integrations/python/dataloader/src/openhouse/dataloader/_jvm.py	Adds `apply_libhdfs_opts()` helper for merging JVM args into `LIBHDFS_OPTS`.
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py	Adds `JvmConfig`, wires planner args application, and passes worker args into scan context.
integrations/python/dataloader/src/openhouse/dataloader/_table_scan_context.py	Extends `TableScanContext` pickle round-trip to include `worker_jvm_args`.
integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py	Applies worker JVM args before `ArrowScan` and adjusts transform flow to force an early read.
integrations/python/dataloader/src/openhouse/dataloader/init.py	Exports `JvmConfig` as part of the public package API.
integrations/python/dataloader/tests/test_jvm.py	New unit tests for `apply_libhdfs_opts()` behavior.
integrations/python/dataloader/tests/test_data_loader.py	Adds unit coverage for planner args being applied during loader initialization.
integrations/python/dataloader/tests/test_data_loader_split.py	Adds unit coverage for worker args being applied during split iteration.
integrations/python/dataloader/tests/integration_tests.py	Adds end-to-end validation that `-Xmx` is honored for planner and worker JVMs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

integrations/python/dataloader/tests/integration_tests.py

integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py

Skip appending if jvm_args already present in LIBHDFS_OPTS. Use a threading lock to prevent concurrent threads from duplicating args.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

integrations/python/dataloader/src/openhouse/dataloader/_jvm.py

integrations/python/dataloader/src/openhouse/dataloader/_table_scan_context.py

integrations/python/dataloader/tests/integration_tests.py

…leanup

robreeves changed the title ~~Add jvm_args support to DataLoaderContext~~ [DataLoader] Support setting JVM args when using JNI Apr 3, 2026

robreeves added 11 commits April 6, 2026 18:36

Fix ruff formatting

b342b26

Extract LIBHDFS_OPTS env var name to constant

8c1ee7e

Defer UDF session creation until after first HDFS read

ed157d5

Ensures the JVM starts with worker_jvm_args before any UDF registration code can trigger JNI.

Add integration test to verify JVM honors planner_jvm_args

cdca869

Pass -Xmx128m via DataLoaderContext on the first DataLoader and verify at the end that the JVM's MaxHeapSize matches 128 MB.

Move jvm_args verification before 'All tests passed' message

2b47f63

Extract _assert_jvm_heap helper and run planner/worker heap checks before printing the final success message. Also fix ruff formatting.

Clarify jvm_args docs: JNI-only, once-per-process semantics

26cfab7

Extract JvmConfig from DataLoaderContext

810a8b9

Move planner_jvm_args and worker_jvm_args into a JvmConfig dataclass to reduce clutter on DataLoaderContext for non-JNI use cases. Usage: DataLoaderContext(jvm=JvmConfig(planner_args="-Xmx2g"))

Fix JvmConfig docstring wording

23cc91a

Rename jvm field to jvm_config on DataLoaderContext

5451c15

Update JvmConfig docstring to say 'storage access'

457da7b

robreeves marked this pull request as ready for review April 7, 2026 17:47

robreeves requested a review from Copilot April 7, 2026 17:48

Copilot started reviewing on behalf of robreeves April 7, 2026 17:49 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

integrations/python/dataloader/tests/integration_tests.py Show resolved Hide resolved

integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py Show resolved Hide resolved

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py Show resolved Hide resolved

Make apply_libhdfs_opts idempotent and thread-safe

02bbb42

Skip appending if jvm_args already present in LIBHDFS_OPTS. Use a threading lock to prevent concurrent threads from duplicating args.

robreeves requested a review from Copilot April 7, 2026 18:22

Copilot started reviewing on behalf of robreeves April 7, 2026 18:23 View session

robreeves requested a review from ShreyeshArangath April 7, 2026 18:24

Copilot AI reviewed Apr 7, 2026

View reviewed changes

integrations/python/dataloader/src/openhouse/dataloader/_jvm.py Show resolved Hide resolved

integrations/python/dataloader/src/openhouse/dataloader/_table_scan_context.py Show resolved Hide resolved

integrations/python/dataloader/tests/integration_tests.py Outdated Show resolved Hide resolved

Add worker_jvm_args to TableScanContext docstring; remove temp file c…

e9c3ec3

…leanup

robreeves requested review from cbb330 and sumedhsakdeo April 7, 2026 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataLoader] Support setting JVM args when using JNI#529

[DataLoader] Support setting JVM args when using JNI#529
robreeves wants to merge 14 commits intolinkedin:mainfrom
robreeves:jvm_args

robreeves commented Apr 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robreeves commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robreeves commented Apr 3, 2026 •

edited

Loading