Describe the bug
When an OSMO task produces large output (e.g. multi-GB model checkpoints), the osmo-ctrl sidecar is terminated before the output upload to S3 completes. The workflow is marked FAILED despite the user container exiting successfully with code 0. Output data is lost.
This affects any workflow where the output upload takes longer than ~30 seconds (the Kubernetes default terminationGracePeriodSeconds).
To reproduce
- Submit a workflow task that writes >1 GB to
/osmo/data/output/
- Wait for the user container to exit successfully
- Observe osmo-ctrl logs — "Upload Start" appears but "Upload Complete" never does
- Workflow status: FAILED
- S3 output path: empty or partial
osmo-ctrl logs
Exec finished
Upload Start
Uploading task:s3://bucket/workflow-id/task-name
Uploading task-name
No "Upload Complete" — the pod was terminated mid-upload.
Environment
- OSMO version: v6.2-rc6
- Kubernetes: 1.31 (EKS)
- Output size: 9.2 GB (GR00T fine-tuning checkpoints)
Root cause analysis
We traced through the OSMO source code and identified two contributing issues:
1. terminationGracePeriodSeconds is never set on task pods
src/service/core/config/objects.py defines DEFAULT_POD_TEMPLATES with osmo-ctrl placed under spec.containers[]. Neither this default template nor the pod template merge logic in src/utils/connectors/postgres.py:calculate_pod_template() sets terminationGracePeriodSeconds. A code search across the repository returns zero results. Task pods run with the Kubernetes default of 30 seconds.
2. osmo-ctrl's SIGTERM handler exits immediately without draining uploads
In src/runtime/cmd/ctrl/ctrl.go, the signal handler is:
signal.Notify(sigintCatch, os.Interrupt, syscall.SIGINT, syscall.SIGTERM)
go func() {
<-sigintCatch
cleanupMounts(cmdArgs.DownloadType)
os.Exit(1)
}()
On SIGTERM, osmo-ctrl unmounts FUSE and calls os.Exit(1) immediately. There is no logic to wait for an in-progress uploadOutputs() call (which spawns an osmo data upload subprocess via src/runtime/pkg/data/data.go) to complete. The upload subprocess is killed along with the osmo-ctrl process.
Combined effect: Even if terminationGracePeriodSeconds were increased, osmo-ctrl would still abort the upload the moment SIGTERM arrives. Both issues must be fixed together.
Proposed fix
Convert osmo-ctrl to a Kubernetes native sidecar container (KEP-753) and fix the SIGTERM handler to drain in-progress uploads. Three changes:
1. Convert osmo-ctrl from regular container to native sidecar
In src/utils/job/task.py, move control_container_spec from spec.containers[] to spec.initContainers[] with restartPolicy: 'Always':
# Current:
'containers': [user_container_spec, control_container_spec],
'initContainers': [
k8s_factory.create_init_container(...),
],
# Proposed:
'containers': [user_container_spec],
'initContainers': [
k8s_factory.create_init_container(...),
{**control_container_spec, 'restartPolicy': 'Always'},
],
'terminationGracePeriodSeconds': 600,
Important: The default_ctrl entry in DEFAULT_POD_TEMPLATES in src/service/core/config/objects.py must NOT also place osmo-ctrl in initContainers — this causes a Duplicate value: "osmo-ctrl" K8s 422 error because the merge_lists_on_name template merge adds a second entry. The template should only set terminationGracePeriodSeconds:
'default_ctrl': {
'spec': {
'terminationGracePeriodSeconds': 600,
}
},
2. Fix SIGTERM handler to drain in-progress uploads
In src/runtime/cmd/ctrl/ctrl.go:
// Package-level variables
var uploading atomic.Bool
var uploadDone = make(chan struct{})
// Wrap uploadOutputs call
uploading.Store(true)
uploadOutputs(...)
uploading.Store(false)
close(uploadDone)
// Replace SIGTERM handler
go func() {
<-sigintCatch
if uploading.Load() {
log.Println("SIGTERM received during upload, waiting for completion...")
select {
case <-uploadDone:
log.Println("Upload completed after SIGTERM, exiting gracefully")
case <-time.After(9 * time.Minute):
log.Println("Upload drain timeout exceeded, forcing exit")
}
}
cleanupMounts(cmdArgs.DownloadType)
os.Exit(1)
}()
3. Update backend_listener.py for native sidecar status
In src/operator/backend_listener.py, check_running_pod_containers() only checks pod.status.container_statuses. With osmo-ctrl as a native sidecar, its status moves to pod.status.init_container_statuses:
# Current:
container_statuses = pod.status.container_statuses if pod.status.container_statuses else []
# Proposed:
container_statuses = list(itertools.chain(
pod.status.init_container_statuses or [],
pod.status.container_statuses or []))
Compatibility
- Native sidecar containers (KEP-753): alpha in K8s 1.28, beta + enabled by default since K8s 1.29, GA in K8s 1.33
- The existing
merge_lists_on_name in src/lib/utils/common.py handles the template merge correctly
get_container_failure_message() in backend_listener.py already checks both init_container_statuses and container_statuses — only check_running_pod_containers() needs updating
- Users on K8s < 1.29 would need to enable the
SidecarContainers feature gate explicitly
Impact
- Training/simulation succeeds but output is silently lost
- Workflow status shows FAILED despite successful execution
- Multi-stage pipelines break (downstream stages have no input data)
- Users have no indication the failure is caused by upload termination — they retry the full compute job, wasting GPU hours
Code of Conduct
Describe the bug
When an OSMO task produces large output (e.g. multi-GB model checkpoints), the
osmo-ctrlsidecar is terminated before the output upload to S3 completes. The workflow is marked FAILED despite the user container exiting successfully with code 0. Output data is lost.This affects any workflow where the output upload takes longer than ~30 seconds (the Kubernetes default
terminationGracePeriodSeconds).To reproduce
/osmo/data/output/osmo-ctrl logs
No "Upload Complete" — the pod was terminated mid-upload.
Environment
Root cause analysis
We traced through the OSMO source code and identified two contributing issues:
1.
terminationGracePeriodSecondsis never set on task podssrc/service/core/config/objects.pydefinesDEFAULT_POD_TEMPLATESwithosmo-ctrlplaced underspec.containers[]. Neither this default template nor the pod template merge logic insrc/utils/connectors/postgres.py:calculate_pod_template()setsterminationGracePeriodSeconds. A code search across the repository returns zero results. Task pods run with the Kubernetes default of 30 seconds.2. osmo-ctrl's SIGTERM handler exits immediately without draining uploads
In
src/runtime/cmd/ctrl/ctrl.go, the signal handler is:On SIGTERM, osmo-ctrl unmounts FUSE and calls
os.Exit(1)immediately. There is no logic to wait for an in-progressuploadOutputs()call (which spawns anosmo data uploadsubprocess viasrc/runtime/pkg/data/data.go) to complete. The upload subprocess is killed along with the osmo-ctrl process.Combined effect: Even if
terminationGracePeriodSecondswere increased, osmo-ctrl would still abort the upload the moment SIGTERM arrives. Both issues must be fixed together.Proposed fix
Convert osmo-ctrl to a Kubernetes native sidecar container (KEP-753) and fix the SIGTERM handler to drain in-progress uploads. Three changes:
1. Convert osmo-ctrl from regular container to native sidecar
In
src/utils/job/task.py, movecontrol_container_specfromspec.containers[]tospec.initContainers[]withrestartPolicy: 'Always':Important: The
default_ctrlentry inDEFAULT_POD_TEMPLATESinsrc/service/core/config/objects.pymust NOT also place osmo-ctrl ininitContainers— this causes aDuplicate value: "osmo-ctrl"K8s 422 error because themerge_lists_on_nametemplate merge adds a second entry. The template should only setterminationGracePeriodSeconds:2. Fix SIGTERM handler to drain in-progress uploads
In
src/runtime/cmd/ctrl/ctrl.go:3. Update backend_listener.py for native sidecar status
In
src/operator/backend_listener.py,check_running_pod_containers()only checkspod.status.container_statuses. With osmo-ctrl as a native sidecar, its status moves topod.status.init_container_statuses:Compatibility
merge_lists_on_nameinsrc/lib/utils/common.pyhandles the template merge correctlyget_container_failure_message()inbackend_listener.pyalready checks bothinit_container_statusesandcontainer_statuses— onlycheck_running_pod_containers()needs updatingSidecarContainersfeature gate explicitlyImpact
Code of Conduct