bug: osmo-ctrl sidecar killed before task output upload completes — data loss for large outputs

### Describe the bug

When an OSMO task produces large output (e.g. multi-GB model checkpoints), the `osmo-ctrl` sidecar is terminated before the output upload to S3 completes. The workflow is marked **FAILED** despite the user container exiting successfully with code 0. Output data is lost.

This affects any workflow where the output upload takes longer than ~30 seconds (the Kubernetes default `terminationGracePeriodSeconds`).

### To reproduce

1. Submit a workflow task that writes >1 GB to `/osmo/data/output/`
2. Wait for the user container to exit successfully
3. Observe osmo-ctrl logs — "Upload Start" appears but "Upload Complete" never does
4. Workflow status: FAILED
5. S3 output path: empty or partial

### osmo-ctrl logs

```
Exec finished
Upload Start
Uploading task:s3://bucket/workflow-id/task-name
Uploading task-name
```

No "Upload Complete" — the pod was terminated mid-upload.

### Environment

- OSMO version: v6.2-rc6
- Kubernetes: 1.31 (EKS)
- Output size: 9.2 GB (GR00T fine-tuning checkpoints)

### Root cause analysis

We traced through the OSMO source code and identified two contributing issues:

**1. `terminationGracePeriodSeconds` is never set on task pods**

[`src/service/core/config/objects.py`](https://github.com/NVIDIA/OSMO/blob/main/src/service/core/config/objects.py) defines `DEFAULT_POD_TEMPLATES` with `osmo-ctrl` placed under `spec.containers[]`. Neither this default template nor the pod template merge logic in [`src/utils/connectors/postgres.py:calculate_pod_template()`](https://github.com/NVIDIA/OSMO/blob/main/src/utils/connectors/postgres.py) sets `terminationGracePeriodSeconds`. A [code search](https://github.com/NVIDIA/OSMO/search?q=terminationGracePeriodSeconds) across the repository returns zero results. Task pods run with the Kubernetes default of 30 seconds.

**2. osmo-ctrl's SIGTERM handler exits immediately without draining uploads**

In [`src/runtime/cmd/ctrl/ctrl.go`](https://github.com/NVIDIA/OSMO/blob/main/src/runtime/cmd/ctrl/ctrl.go), the signal handler is:

```go
signal.Notify(sigintCatch, os.Interrupt, syscall.SIGINT, syscall.SIGTERM)
go func() {
    <-sigintCatch
    cleanupMounts(cmdArgs.DownloadType)
    os.Exit(1)
}()
```

On SIGTERM, osmo-ctrl unmounts FUSE and calls `os.Exit(1)` immediately. There is no logic to wait for an in-progress `uploadOutputs()` call (which spawns an `osmo data upload` subprocess via [`src/runtime/pkg/data/data.go`](https://github.com/NVIDIA/OSMO/blob/main/src/runtime/pkg/data/data.go)) to complete. The upload subprocess is killed along with the osmo-ctrl process.

**Combined effect:** Even if `terminationGracePeriodSeconds` were increased, osmo-ctrl would still abort the upload the moment SIGTERM arrives. Both issues must be fixed together.

### Proposed fix

Convert osmo-ctrl to a [Kubernetes native sidecar container](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) (KEP-753) and fix the SIGTERM handler to drain in-progress uploads. Three changes:

#### 1. Convert osmo-ctrl from regular container to native sidecar

In [`src/utils/job/task.py`](https://github.com/NVIDIA/OSMO/blob/main/src/utils/job/task.py), move `control_container_spec` from `spec.containers[]` to `spec.initContainers[]` with `restartPolicy: 'Always'`:

```python
# Current:
'containers': [user_container_spec, control_container_spec],
'initContainers': [
    k8s_factory.create_init_container(...),
],

# Proposed:
'containers': [user_container_spec],
'initContainers': [
    k8s_factory.create_init_container(...),
    {**control_container_spec, 'restartPolicy': 'Always'},
],
'terminationGracePeriodSeconds': 600,
```

**Important:** The `default_ctrl` entry in `DEFAULT_POD_TEMPLATES` in [`src/service/core/config/objects.py`](https://github.com/NVIDIA/OSMO/blob/main/src/service/core/config/objects.py) must NOT also place osmo-ctrl in `initContainers` — this causes a `Duplicate value: "osmo-ctrl"` K8s 422 error because the `merge_lists_on_name` template merge adds a second entry. The template should only set `terminationGracePeriodSeconds`:

```python
'default_ctrl': {
    'spec': {
        'terminationGracePeriodSeconds': 600,
    }
},
```

#### 2. Fix SIGTERM handler to drain in-progress uploads

In [`src/runtime/cmd/ctrl/ctrl.go`](https://github.com/NVIDIA/OSMO/blob/main/src/runtime/cmd/ctrl/ctrl.go):

```go
// Package-level variables
var uploading atomic.Bool
var uploadDone = make(chan struct{})

// Wrap uploadOutputs call
uploading.Store(true)
uploadOutputs(...)
uploading.Store(false)
close(uploadDone)

// Replace SIGTERM handler
go func() {
    <-sigintCatch
    if uploading.Load() {
        log.Println("SIGTERM received during upload, waiting for completion...")
        select {
        case <-uploadDone:
            log.Println("Upload completed after SIGTERM, exiting gracefully")
        case <-time.After(9 * time.Minute):
            log.Println("Upload drain timeout exceeded, forcing exit")
        }
    }
    cleanupMounts(cmdArgs.DownloadType)
    os.Exit(1)
}()
```

#### 3. Update backend_listener.py for native sidecar status

In [`src/operator/backend_listener.py`](https://github.com/NVIDIA/OSMO/blob/main/src/operator/backend_listener.py), `check_running_pod_containers()` only checks `pod.status.container_statuses`. With osmo-ctrl as a native sidecar, its status moves to `pod.status.init_container_statuses`:

```python
# Current:
container_statuses = pod.status.container_statuses if pod.status.container_statuses else []

# Proposed:
container_statuses = list(itertools.chain(
    pod.status.init_container_statuses or [],
    pod.status.container_statuses or []))
```

### Compatibility

- Native sidecar containers (KEP-753): alpha in K8s 1.28, **beta + enabled by default since K8s 1.29**, GA in K8s 1.33
- The existing `merge_lists_on_name` in [`src/lib/utils/common.py`](https://github.com/NVIDIA/OSMO/blob/main/src/lib/utils/common.py) handles the template merge correctly
- `get_container_failure_message()` in `backend_listener.py` already checks both `init_container_statuses` and `container_statuses` — only `check_running_pod_containers()` needs updating
- Users on K8s < 1.29 would need to enable the `SidecarContainers` feature gate explicitly

### Impact

- Training/simulation succeeds but output is silently lost
- Workflow status shows FAILED despite successful execution
- Multi-stage pipelines break (downstream stages have no input data)
- Users have no indication the failure is caused by upload termination — they retry the full compute job, wasting GPU hours

### Code of Conduct

- [x] I agree to follow NVIDIA OSMO's Code of Conduct
- [x] I have searched the [open issues](https://github.com/NVIDIA/OSMO/issues) and have found no duplicates for this report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: osmo-ctrl sidecar killed before task output upload completes — data loss for large outputs #764

Describe the bug

To reproduce

osmo-ctrl logs

Environment

Root cause analysis

Proposed fix

1. Convert osmo-ctrl from regular container to native sidecar

2. Fix SIGTERM handler to drain in-progress uploads

3. Update backend_listener.py for native sidecar status

Compatibility

Impact

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: osmo-ctrl sidecar killed before task output upload completes — data loss for large outputs #764

Description

Describe the bug

To reproduce

osmo-ctrl logs

Environment

Root cause analysis

Proposed fix

1. Convert osmo-ctrl from regular container to native sidecar

2. Fix SIGTERM handler to drain in-progress uploads

3. Update backend_listener.py for native sidecar status

Compatibility

Impact

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions