Skip to content

Make prompt dataloader num_workers/persistent_workers configurable#1746

Open
jamesbraza wants to merge 2 commits into
NovaSky-AI:mainfrom
EdisonScientific:dataloader-worker-config
Open

Make prompt dataloader num_workers/persistent_workers configurable#1746
jamesbraza wants to merge 2 commits into
NovaSky-AI:mainfrom
EdisonScientific:dataloader-worker-config

Conversation

@jamesbraza
Copy link
Copy Markdown
Contributor

Adds a data.dataloader config section exposing num_workers and persistent_workers(previously hardcoded) with backward-compatible defaults.

Closes #1744.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new DataLoaderConfig to configure num_workers and persistent_workers for training and evaluation dataloaders, replacing previous hardcoded values. It also adds validation rules and unit tests to verify the configuration resolution. The reviewer recommends disabling persistent_workers for evaluation dataloaders (is_train=False) to prevent unnecessary resource retention and potential delays in releasing worker processes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread skyrl/train/utils/trainer_utils.py
jamesbraza and others added 2 commits June 3, 2026 15:52
The prompt DataLoader's num_workers and persistent_workers were hardcoded
(num_workers auto-derived from whether the inference HTTP endpoint is on), so a
run could not opt into in-process loading. Add a data.dataloader config section
exposing both, documented via field metadata.

num_workers defaults to None and is auto-derived in SkyRLTrainConfig.__post_init__
(0 with the HTTP endpoint, else 8), preserving prior behavior; persistent_workers
defaults to False. __post_init__ validates the resolved values: persistent_workers
requires num_workers > 0, and num_workers must be None or >= 0.

Setting num_workers=0 loads the in-memory dataset in-process, avoiding the worker
respawn (and venv re-import it triggers) at every epoch boundary that can hang
silently when a shared filesystem such as NFS is overwhelmed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
build_dataloader now reads the resolved data.dataloader.num_workers and
persistent_workers instead of hardcoding the worker count and deriving the
multiprocessing context from whether the inference HTTP endpoint is on. The spawn
start method is derived from the effective worker count (spawn only when
num_workers > 0), which preserves the prior default and makes num_workers=0 valid.

Resolves NovaSky-AI#1744

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jamesbraza jamesbraza force-pushed the dataloader-worker-config branch from d4ae38f to 164987d Compare June 3, 2026 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Configuration for SkyRL to store training data in-memory, instead of reloading every epoch

1 participant