Skip to content

bug: osmo-ctrl sidecar killed before task output upload completes — data loss for large outputs #764

@KeitaW

Description

@KeitaW

Describe the bug

When an OSMO task produces large output (e.g. multi-GB model checkpoints), the osmo-ctrl sidecar is terminated before the output upload to S3 completes. The workflow is marked FAILED despite the user container exiting successfully with code 0. Output data is lost.

This affects any workflow where the output upload takes longer than ~30 seconds (the Kubernetes default terminationGracePeriodSeconds).

To reproduce

  1. Submit a workflow task that writes >1 GB to /osmo/data/output/
  2. Wait for the user container to exit successfully
  3. Observe osmo-ctrl logs — "Upload Start" appears but "Upload Complete" never does
  4. Workflow status: FAILED
  5. S3 output path: empty or partial

osmo-ctrl logs

Exec finished
Upload Start
Uploading task:s3://bucket/workflow-id/task-name
Uploading task-name

No "Upload Complete" — the pod was terminated mid-upload.

Environment

  • OSMO version: v6.2-rc6
  • Kubernetes: 1.31 (EKS)
  • Output size: 9.2 GB (GR00T fine-tuning checkpoints)

Root cause analysis

We traced through the OSMO source code and identified two contributing issues:

1. terminationGracePeriodSeconds is never set on task pods

src/service/core/config/objects.py defines DEFAULT_POD_TEMPLATES with osmo-ctrl placed under spec.containers[]. Neither this default template nor the pod template merge logic in src/utils/connectors/postgres.py:calculate_pod_template() sets terminationGracePeriodSeconds. A code search across the repository returns zero results. Task pods run with the Kubernetes default of 30 seconds.

2. osmo-ctrl's SIGTERM handler exits immediately without draining uploads

In src/runtime/cmd/ctrl/ctrl.go, the signal handler is:

signal.Notify(sigintCatch, os.Interrupt, syscall.SIGINT, syscall.SIGTERM)
go func() {
    <-sigintCatch
    cleanupMounts(cmdArgs.DownloadType)
    os.Exit(1)
}()

On SIGTERM, osmo-ctrl unmounts FUSE and calls os.Exit(1) immediately. There is no logic to wait for an in-progress uploadOutputs() call (which spawns an osmo data upload subprocess via src/runtime/pkg/data/data.go) to complete. The upload subprocess is killed along with the osmo-ctrl process.

Combined effect: Even if terminationGracePeriodSeconds were increased, osmo-ctrl would still abort the upload the moment SIGTERM arrives. Both issues must be fixed together.

Proposed fix

Convert osmo-ctrl to a Kubernetes native sidecar container (KEP-753) and fix the SIGTERM handler to drain in-progress uploads. Three changes:

1. Convert osmo-ctrl from regular container to native sidecar

In src/utils/job/task.py, move control_container_spec from spec.containers[] to spec.initContainers[] with restartPolicy: 'Always':

# Current:
'containers': [user_container_spec, control_container_spec],
'initContainers': [
    k8s_factory.create_init_container(...),
],

# Proposed:
'containers': [user_container_spec],
'initContainers': [
    k8s_factory.create_init_container(...),
    {**control_container_spec, 'restartPolicy': 'Always'},
],
'terminationGracePeriodSeconds': 600,

Important: The default_ctrl entry in DEFAULT_POD_TEMPLATES in src/service/core/config/objects.py must NOT also place osmo-ctrl in initContainers — this causes a Duplicate value: "osmo-ctrl" K8s 422 error because the merge_lists_on_name template merge adds a second entry. The template should only set terminationGracePeriodSeconds:

'default_ctrl': {
    'spec': {
        'terminationGracePeriodSeconds': 600,
    }
},

2. Fix SIGTERM handler to drain in-progress uploads

In src/runtime/cmd/ctrl/ctrl.go:

// Package-level variables
var uploading atomic.Bool
var uploadDone = make(chan struct{})

// Wrap uploadOutputs call
uploading.Store(true)
uploadOutputs(...)
uploading.Store(false)
close(uploadDone)

// Replace SIGTERM handler
go func() {
    <-sigintCatch
    if uploading.Load() {
        log.Println("SIGTERM received during upload, waiting for completion...")
        select {
        case <-uploadDone:
            log.Println("Upload completed after SIGTERM, exiting gracefully")
        case <-time.After(9 * time.Minute):
            log.Println("Upload drain timeout exceeded, forcing exit")
        }
    }
    cleanupMounts(cmdArgs.DownloadType)
    os.Exit(1)
}()

3. Update backend_listener.py for native sidecar status

In src/operator/backend_listener.py, check_running_pod_containers() only checks pod.status.container_statuses. With osmo-ctrl as a native sidecar, its status moves to pod.status.init_container_statuses:

# Current:
container_statuses = pod.status.container_statuses if pod.status.container_statuses else []

# Proposed:
container_statuses = list(itertools.chain(
    pod.status.init_container_statuses or [],
    pod.status.container_statuses or []))

Compatibility

  • Native sidecar containers (KEP-753): alpha in K8s 1.28, beta + enabled by default since K8s 1.29, GA in K8s 1.33
  • The existing merge_lists_on_name in src/lib/utils/common.py handles the template merge correctly
  • get_container_failure_message() in backend_listener.py already checks both init_container_statuses and container_statuses — only check_running_pod_containers() needs updating
  • Users on K8s < 1.29 would need to enable the SidecarContainers feature gate explicitly

Impact

  • Training/simulation succeeds but output is silently lost
  • Workflow status shows FAILED despite successful execution
  • Multi-stage pipelines break (downstream stages have no input data)
  • Users have no indication the failure is caused by upload termination — they retry the full compute job, wasting GPU hours

Code of Conduct

  • I agree to follow NVIDIA OSMO's Code of Conduct
  • I have searched the open issues and have found no duplicates for this report

Metadata

Metadata

Assignees

No one assigned

    Labels

    externalThe author is not in @NVIDIA/osmo-dev

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions