[BUG] CPU admission can exceed max_cpu_utilization in edge cases

## Describe the bug

I'm running LiveKit Egress through OpenVidu V3 and observed a case where the egress node exceeded the configured CPU admission limit and started killing egress sessions due to sustained high CPU.

The egress config has:

```yaml
cpu_cost:
  max_cpu_utilization: 0.8
```

However, during a burst of room-composite egress starts on a single egress node, the service accepted new requests and shortly after reported CPU usage around 95–99% with `requests: 8`.

The service then started logging:

```text
high cpu usage
killing egress due to sustained high cpu
```

This affected egresses that were already active on the same node.

Based on reviewing `pkg/stats/monitor.go`, I believe this is not necessarily a concurrent double-accept race, since `AcceptRequest` is serialized with `m.mu`. Instead, the issue appears to be caused by CPU admission/accounting edge cases:

1. `pendingCPU` is reset from a `time.AfterFunc` callback without holding `m.mu`, while other goroutines read it under `m.mu`.
2. When `m.requests.Load() == 0`, the first request appears to be checked against total CPU, not `total * MaxCpuUtilization`.
3. `cpuHoldDuration = 15s` may release the CPU reservation before the real CPU cost of Chrome/GStreamer/RTMP encoding is visible in the process stats. During warmup, `lastCPU` can still be too low, allowing additional starts that later push the node above `max_cpu_utilization`.

In the observed incident, two new room-composite egresses were accepted almost at the same time. They became active within a few seconds, and shortly after the node reported sustained high CPU and started killing egresses.

## Egress Version

Running through OpenVidu V3.

OpenVidu component observed in logs:

```text
@openvidu-meet/backend@3.6.0
```

The exact LiveKit Egress image/version still needs to be confirmed from the deployment.

Related log path references show LiveKit Egress code paths such as:

```text
github.com/livekit/egress/pkg/...
```

and gst-go dependency:

```text
github.com/livekit/gst-go@v0.0.0-20250701011214-e7f61abd14cb
```

## Egress Request

Two room-composite stream egresses were started almost at the same time on the same egress node.

### Request 1

```json
{
  "RoomComposite": {
    "room_name": "room-1",
    "layout": "PRESENTERS_ONLY",
    "custom_base_url": "https://<redacted-custom-layout-url>",
    "Output": {
      "Stream": {
        "protocol": 1,
        "urls": ["rtmp://<redacted>/live/<redacted>"]
      }
    },
    "Options": {
      "Preset": 2
    }
  }
}
```

### Request 2

```json
{
  "RoomComposite": {
    "room_name": "room-2",
    "layout": "PRESENTERS_ONLY",
    "custom_base_url": "https://<redacted-custom-layout-url>",
    "Output": {
      "Stream": {
        "protocol": 1,
        "urls": ["rtmp://<redacted>/live/<redacted>"]
      }
    },
    "Options": {
      "Preset": 2
    }
  }
}
```

## Additional context

This was a single-node egress instance used by OpenVidu V3.

At the time of the incident, the node already had multiple active egress requests. The two new egresses were accepted at `2026-04-29T01:04:33Z` and became active at `2026-04-29T01:04:36Z`.

Immediately after that, the egress service reported:

```text
cpu ~= 0.95–0.99
requests: 8
```

Then the egress service started killing egresses due to sustained high CPU.

Important timing:

```text
2026-04-29T01:04:33Z - two new egress requests accepted
2026-04-29T01:04:36Z - both became EGRESS_ACTIVE
2026-04-29T01:04:38Z - high CPU warning started, cpu ~0.963, requests: 8
2026-04-29T01:04:47Z - killing egress due to sustained high CPU
2026-04-29T01:05:04Z - killing egress due to sustained high CPU
2026-04-29T01:05:16Z - killing egress due to sustained high CPU
```

Later, affected egresses eventually ended with `Closed Remotely`, but that was a delayed final state. The operational issue started around `2026-04-29T01:04:38Z`, when CPU admission/capacity was exceeded.

The relevant event on the egress node was sustained high CPU followed by LiveKit Egress killing sessions.

## Logs

Relevant sanitized logs:

```text
2026-04-29T01:04:33.140Z INFO egress server/server_rpc.go:64 request received
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_2X8PoxUbrAE2"}

2026-04-29T01:04:33.142Z INFO egress server/server_rpc.go:74 request validated
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_2X8PoxUbrAE2","requestType":"room_composite","sourceType":"EGRESS_SOURCE_TYPE_WEB","outputType":"stream","room":"room-1"}

2026-04-29T01:04:33.458Z INFO egress server/server_rpc.go:64 request received
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_NXLkCKcy6YWg"}

2026-04-29T01:04:33.458Z INFO egress server/server_rpc.go:74 request validated
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_NXLkCKcy6YWg","requestType":"room_composite","sourceType":"EGRESS_SOURCE_TYPE_WEB","outputType":"stream","room":"room-2"}

2026-04-29T01:04:36.029Z INFO egress pipeline/watch.go:259 pipeline playing
{"nodeID":"NE_Ld6efiWfXu4w","handlerID":"EGH_NkbCG4pnGkWS","egressID":"EG_2X8PoxUbrAE2","timeToPlaying":"1.009169877s"}

2026-04-29T01:04:36.031Z INFO egress info/io.go:273 egress_active
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_2X8PoxUbrAE2","requestType":"room_composite","outputType":"stream"}

2026-04-29T01:04:36.218Z INFO egress pipeline/watch.go:259 pipeline playing
{"nodeID":"NE_Ld6efiWfXu4w","handlerID":"EGH_5SHEzXBcsn6u","egressID":"EG_NXLkCKcy6YWg","timeToPlaying":"1.01701603s"}

2026-04-29T01:04:36.222Z INFO egress info/io.go:273 egress_active
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_NXLkCKcy6YWg","requestType":"room_composite","outputType":"stream"}

2026-04-29T01:04:38.145Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9633333333333333,"requests":8}

2026-04-29T01:04:39.132Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.960669456080048,"requests":8}

2026-04-29T01:04:40.143Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9591666666666663,"requests":8}

2026-04-29T01:04:41.135Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9684123025558992,"requests":8}

2026-04-29T01:04:42.155Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9691923397255453,"requests":8}

2026-04-29T01:04:47.153Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9766860949296086,"requests":8}

2026-04-29T01:04:47.158Z WARN egress stats/monitor.go:641 killing egress due to sustained high cpu
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9766860949296086}

2026-04-29T01:05:04.162Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9502487562145029,"requests":8}

2026-04-29T01:05:04.163Z WARN egress stats/monitor.go:641 killing egress due to sustained high cpu
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9502487562145029}

2026-04-29T01:05:16.139Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9592346089835386,"requests":8}

2026-04-29T01:05:16.139Z WARN egress stats/monitor.go:641 killing egress due to sustained high cpu
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9592346089835386}
```

## Possible cause / analysis

After reviewing `pkg/stats/monitor.go`, I suspect the admission logic can underestimate real CPU during egress warmup.

The two new requests were accepted before their real CPU cost was visible in process stats. A few seconds later, when Chrome/GStreamer/RTMP encoding started, CPU jumped above the configured `max_cpu_utilization`.

Potential contributing factors:

1. `pendingCPU` is reset by the `cpuHoldDuration` timer, but if real CPU samples are still low or missing during warmup, the monitor may admit more work than the node can actually handle.
2. The first request path appears to use full CPU capacity instead of `total * MaxCpuUtilization`.
3. `pendingCPU` reset appears to happen from a timer goroutine without holding `m.mu`, while other paths read it under lock.

I can provide a PR for items (1) and (2), but would like maintainer guidance on how to handle (3), especially whether `pendingCPU` should remain as a floor until real CPU samples are available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CPU admission can exceed max_cpu_utilization in edge cases #1205

Describe the bug

Egress Version

Egress Request

Request 1

Request 2

Additional context

Logs

Possible cause / analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] CPU admission can exceed max_cpu_utilization in edge cases #1205

Description

Describe the bug

Egress Version

Egress Request

Request 1

Request 2

Additional context

Logs

Possible cause / analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions