Make prompt dataloader num_workers/persistent_workers configurable#1746
Make prompt dataloader num_workers/persistent_workers configurable#1746jamesbraza wants to merge 2 commits into
num_workers/persistent_workers configurable#1746Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new DataLoaderConfig to configure num_workers and persistent_workers for training and evaluation dataloaders, replacing previous hardcoded values. It also adds validation rules and unit tests to verify the configuration resolution. The reviewer recommends disabling persistent_workers for evaluation dataloaders (is_train=False) to prevent unnecessary resource retention and potential delays in releasing worker processes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
The prompt DataLoader's num_workers and persistent_workers were hardcoded (num_workers auto-derived from whether the inference HTTP endpoint is on), so a run could not opt into in-process loading. Add a data.dataloader config section exposing both, documented via field metadata. num_workers defaults to None and is auto-derived in SkyRLTrainConfig.__post_init__ (0 with the HTTP endpoint, else 8), preserving prior behavior; persistent_workers defaults to False. __post_init__ validates the resolved values: persistent_workers requires num_workers > 0, and num_workers must be None or >= 0. Setting num_workers=0 loads the in-memory dataset in-process, avoiding the worker respawn (and venv re-import it triggers) at every epoch boundary that can hang silently when a shared filesystem such as NFS is overwhelmed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
build_dataloader now reads the resolved data.dataloader.num_workers and persistent_workers instead of hardcoding the worker count and deriving the multiprocessing context from whether the inference HTTP endpoint is on. The spawn start method is derived from the effective worker count (spawn only when num_workers > 0), which preserves the prior default and makes num_workers=0 valid. Resolves NovaSky-AI#1744 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d4ae38f to
164987d
Compare
Adds a
data.dataloaderconfig section exposingnum_workersandpersistent_workers(previously hardcoded) with backward-compatible defaults.Closes #1744.