[DataLoader] Support setting JVM args when using JNI#529
Open
robreeves wants to merge 14 commits intolinkedin:mainfrom
Open
[DataLoader] Support setting JVM args when using JNI#529robreeves wants to merge 14 commits intolinkedin:mainfrom
robreeves wants to merge 14 commits intolinkedin:mainfrom
Conversation
Allow users to pass JVM arguments (e.g. -Xmx512m, -verbose:gc, -D flags) to the JNI JVM used by the HDFS client. Adds planner_jvm_args and worker_jvm_args to DataLoaderContext so planner and worker processes can use different JVM configurations.
Ensures the JVM starts with worker_jvm_args before any UDF registration code can trigger JNI.
Pass -Xmx128m via DataLoaderContext on the first DataLoader and verify at the end that the JVM's MaxHeapSize matches 128 MB.
- Set -Xmx127m via planner_jvm_args on the first DataLoader and verify the planner JVM's MaxHeapSize is capped at 128m. - Spawn a child process to materialize a split with -Xmx254m via worker_jvm_args and verify the worker JVM gets a larger, distinct heap size.
Extract _assert_jvm_heap helper and run planner/worker heap checks before printing the final success message. Also fix ruff formatting.
Move planner_jvm_args and worker_jvm_args into a JvmConfig dataclass to reduce clutter on DataLoaderContext for non-JNI use cases. Usage: DataLoaderContext(jvm=JvmConfig(planner_args="-Xmx2g"))
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a client-facing way to configure JVM arguments used by libhdfs/JNI (via LIBHDFS_OPTS), so DataLoader can control JVM settings like max heap size in both planner and worker processes.
Changes:
- Introduces
JvmConfigonDataLoaderContextand appliesplanner_argsearly inOpenHouseDataLoader.__init__. - Propagates
worker_argsthroughTableScanContextand applies them when materializingDataLoaderSplits (before potential JNI triggers). - Adds unit and integration tests validating that
LIBHDFS_OPTSis set and that-Xmxis honored by planner/worker JVMs.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| integrations/python/dataloader/src/openhouse/dataloader/_jvm.py | Adds apply_libhdfs_opts() helper for merging JVM args into LIBHDFS_OPTS. |
| integrations/python/dataloader/src/openhouse/dataloader/data_loader.py | Adds JvmConfig, wires planner args application, and passes worker args into scan context. |
| integrations/python/dataloader/src/openhouse/dataloader/_table_scan_context.py | Extends TableScanContext pickle round-trip to include worker_jvm_args. |
| integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py | Applies worker JVM args before ArrowScan and adjusts transform flow to force an early read. |
| integrations/python/dataloader/src/openhouse/dataloader/init.py | Exports JvmConfig as part of the public package API. |
| integrations/python/dataloader/tests/test_jvm.py | New unit tests for apply_libhdfs_opts() behavior. |
| integrations/python/dataloader/tests/test_data_loader.py | Adds unit coverage for planner args being applied during loader initialization. |
| integrations/python/dataloader/tests/test_data_loader_split.py | Adds unit coverage for worker args being applied during split iteration. |
| integrations/python/dataloader/tests/integration_tests.py | Adds end-to-end validation that -Xmx is honored for planner and worker JVMs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py
Show resolved
Hide resolved
Skip appending if jvm_args already present in LIBHDFS_OPTS. Use a threading lock to prevent concurrent threads from duplicating args.
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
integrations/python/dataloader/src/openhouse/dataloader/_table_scan_context.py
Show resolved
Hide resolved
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Data loader uses PyIceberg to load data. If the data is on HDFS then PyIceberg uses libhdfs to load data, which uses the JNI. This extends data loader to be able to configure the JVM created when using JNI. This is important for things like the max heap size. For example, if data loader runs on a machine with a lot of phyiscal memory then the default heap size will be much bigger than needed (half of physical memory by default).
This also makes sure that data loader is the first thing to use the JNI. This is important because the JNI will create one JVM per process. If something else uses the JNI it will create the JVM and it will be reused for data loader. The params specified in data loader would be ignored. By making sure data loader is used first, it guarantees the JVM will created with the user-defined params.
Changes
Testing Done
I added new unit tests and an integration test case for both the planner and worker jvm args. From the integration tests:
Additional Information