Skip to content

[DataLoader] Support setting JVM args when using JNI#529

Open
robreeves wants to merge 14 commits intolinkedin:mainfrom
robreeves:jvm_args
Open

[DataLoader] Support setting JVM args when using JNI#529
robreeves wants to merge 14 commits intolinkedin:mainfrom
robreeves:jvm_args

Conversation

@robreeves
Copy link
Copy Markdown
Collaborator

@robreeves robreeves commented Apr 3, 2026

Summary

Data loader uses PyIceberg to load data. If the data is on HDFS then PyIceberg uses libhdfs to load data, which uses the JNI. This extends data loader to be able to configure the JVM created when using JNI. This is important for things like the max heap size. For example, if data loader runs on a machine with a lot of phyiscal memory then the default heap size will be much bigger than needed (half of physical memory by default).

This also makes sure that data loader is the first thing to use the JNI. This is important because the JNI will create one JVM per process. If something else uses the JNI it will create the JVM and it will be reused for data loader. The params specified in data loader would be ignored. By making sure data loader is used first, it guarantees the JVM will created with the user-defined params.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

I added new unit tests and an integration test case for both the planner and worker jvm args. From the integration tests:

PASS: planner_jvm_args honored by JVM (MaxHeapSize=134217728)
PASS: worker_jvm_args honored by child JVM (MaxHeapSize=266338304)

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

Allow users to pass JVM arguments (e.g. -Xmx512m, -verbose:gc, -D flags)
to the JNI JVM used by the HDFS client. Adds planner_jvm_args and
worker_jvm_args to DataLoaderContext so planner and worker processes can
use different JVM configurations.
@robreeves robreeves changed the title Add jvm_args support to DataLoaderContext [DataLoader] Support setting JVM args when using JNI Apr 3, 2026
robreeves added 11 commits April 6, 2026 18:36
Ensures the JVM starts with worker_jvm_args before any UDF
registration code can trigger JNI.
Pass -Xmx128m via DataLoaderContext on the first DataLoader and
verify at the end that the JVM's MaxHeapSize matches 128 MB.
- Set -Xmx127m via planner_jvm_args on the first DataLoader and verify
  the planner JVM's MaxHeapSize is capped at 128m.
- Spawn a child process to materialize a split with -Xmx254m via
  worker_jvm_args and verify the worker JVM gets a larger, distinct
  heap size.
Extract _assert_jvm_heap helper and run planner/worker heap checks
before printing the final success message. Also fix ruff formatting.
Move planner_jvm_args and worker_jvm_args into a JvmConfig dataclass
to reduce clutter on DataLoaderContext for non-JNI use cases.

Usage: DataLoaderContext(jvm=JvmConfig(planner_args="-Xmx2g"))
@robreeves robreeves marked this pull request as ready for review April 7, 2026 17:47
@robreeves robreeves requested a review from Copilot April 7, 2026 17:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a client-facing way to configure JVM arguments used by libhdfs/JNI (via LIBHDFS_OPTS), so DataLoader can control JVM settings like max heap size in both planner and worker processes.

Changes:

  • Introduces JvmConfig on DataLoaderContext and applies planner_args early in OpenHouseDataLoader.__init__.
  • Propagates worker_args through TableScanContext and applies them when materializing DataLoaderSplits (before potential JNI triggers).
  • Adds unit and integration tests validating that LIBHDFS_OPTS is set and that -Xmx is honored by planner/worker JVMs.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
integrations/python/dataloader/src/openhouse/dataloader/_jvm.py Adds apply_libhdfs_opts() helper for merging JVM args into LIBHDFS_OPTS.
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py Adds JvmConfig, wires planner args application, and passes worker args into scan context.
integrations/python/dataloader/src/openhouse/dataloader/_table_scan_context.py Extends TableScanContext pickle round-trip to include worker_jvm_args.
integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py Applies worker JVM args before ArrowScan and adjusts transform flow to force an early read.
integrations/python/dataloader/src/openhouse/dataloader/init.py Exports JvmConfig as part of the public package API.
integrations/python/dataloader/tests/test_jvm.py New unit tests for apply_libhdfs_opts() behavior.
integrations/python/dataloader/tests/test_data_loader.py Adds unit coverage for planner args being applied during loader initialization.
integrations/python/dataloader/tests/test_data_loader_split.py Adds unit coverage for worker args being applied during split iteration.
integrations/python/dataloader/tests/integration_tests.py Adds end-to-end validation that -Xmx is honored for planner and worker JVMs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Skip appending if jvm_args already present in LIBHDFS_OPTS. Use a
threading lock to prevent concurrent threads from duplicating args.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@robreeves robreeves requested review from cbb330 and sumedhsakdeo April 7, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants