Describe the bug
I'm running LiveKit Egress through OpenVidu V3 and observed a case where the egress node exceeded the configured CPU admission limit and started killing egress sessions due to sustained high CPU.
The egress config has:
cpu_cost:
max_cpu_utilization: 0.8
However, during a burst of room-composite egress starts on a single egress node, the service accepted new requests and shortly after reported CPU usage around 95–99% with requests: 8.
The service then started logging:
high cpu usage
killing egress due to sustained high cpu
This affected egresses that were already active on the same node.
Based on reviewing pkg/stats/monitor.go, I believe this is not necessarily a concurrent double-accept race, since AcceptRequest is serialized with m.mu. Instead, the issue appears to be caused by CPU admission/accounting edge cases:
pendingCPU is reset from a time.AfterFunc callback without holding m.mu, while other goroutines read it under m.mu.
- When
m.requests.Load() == 0, the first request appears to be checked against total CPU, not total * MaxCpuUtilization.
cpuHoldDuration = 15s may release the CPU reservation before the real CPU cost of Chrome/GStreamer/RTMP encoding is visible in the process stats. During warmup, lastCPU can still be too low, allowing additional starts that later push the node above max_cpu_utilization.
In the observed incident, two new room-composite egresses were accepted almost at the same time. They became active within a few seconds, and shortly after the node reported sustained high CPU and started killing egresses.
Egress Version
Running through OpenVidu V3.
OpenVidu component observed in logs:
@openvidu-meet/backend@3.6.0
The exact LiveKit Egress image/version still needs to be confirmed from the deployment.
Related log path references show LiveKit Egress code paths such as:
github.com/livekit/egress/pkg/...
and gst-go dependency:
github.com/livekit/gst-go@v0.0.0-20250701011214-e7f61abd14cb
Egress Request
Two room-composite stream egresses were started almost at the same time on the same egress node.
Request 1
{
"RoomComposite": {
"room_name": "room-1",
"layout": "PRESENTERS_ONLY",
"custom_base_url": "https://<redacted-custom-layout-url>",
"Output": {
"Stream": {
"protocol": 1,
"urls": ["rtmp://<redacted>/live/<redacted>"]
}
},
"Options": {
"Preset": 2
}
}
}
Request 2
{
"RoomComposite": {
"room_name": "room-2",
"layout": "PRESENTERS_ONLY",
"custom_base_url": "https://<redacted-custom-layout-url>",
"Output": {
"Stream": {
"protocol": 1,
"urls": ["rtmp://<redacted>/live/<redacted>"]
}
},
"Options": {
"Preset": 2
}
}
}
Additional context
This was a single-node egress instance used by OpenVidu V3.
At the time of the incident, the node already had multiple active egress requests. The two new egresses were accepted at 2026-04-29T01:04:33Z and became active at 2026-04-29T01:04:36Z.
Immediately after that, the egress service reported:
cpu ~= 0.95–0.99
requests: 8
Then the egress service started killing egresses due to sustained high CPU.
Important timing:
2026-04-29T01:04:33Z - two new egress requests accepted
2026-04-29T01:04:36Z - both became EGRESS_ACTIVE
2026-04-29T01:04:38Z - high CPU warning started, cpu ~0.963, requests: 8
2026-04-29T01:04:47Z - killing egress due to sustained high CPU
2026-04-29T01:05:04Z - killing egress due to sustained high CPU
2026-04-29T01:05:16Z - killing egress due to sustained high CPU
Later, affected egresses eventually ended with Closed Remotely, but that was a delayed final state. The operational issue started around 2026-04-29T01:04:38Z, when CPU admission/capacity was exceeded.
The relevant event on the egress node was sustained high CPU followed by LiveKit Egress killing sessions.
Logs
Relevant sanitized logs:
2026-04-29T01:04:33.140Z INFO egress server/server_rpc.go:64 request received
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_2X8PoxUbrAE2"}
2026-04-29T01:04:33.142Z INFO egress server/server_rpc.go:74 request validated
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_2X8PoxUbrAE2","requestType":"room_composite","sourceType":"EGRESS_SOURCE_TYPE_WEB","outputType":"stream","room":"room-1"}
2026-04-29T01:04:33.458Z INFO egress server/server_rpc.go:64 request received
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_NXLkCKcy6YWg"}
2026-04-29T01:04:33.458Z INFO egress server/server_rpc.go:74 request validated
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_NXLkCKcy6YWg","requestType":"room_composite","sourceType":"EGRESS_SOURCE_TYPE_WEB","outputType":"stream","room":"room-2"}
2026-04-29T01:04:36.029Z INFO egress pipeline/watch.go:259 pipeline playing
{"nodeID":"NE_Ld6efiWfXu4w","handlerID":"EGH_NkbCG4pnGkWS","egressID":"EG_2X8PoxUbrAE2","timeToPlaying":"1.009169877s"}
2026-04-29T01:04:36.031Z INFO egress info/io.go:273 egress_active
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_2X8PoxUbrAE2","requestType":"room_composite","outputType":"stream"}
2026-04-29T01:04:36.218Z INFO egress pipeline/watch.go:259 pipeline playing
{"nodeID":"NE_Ld6efiWfXu4w","handlerID":"EGH_5SHEzXBcsn6u","egressID":"EG_NXLkCKcy6YWg","timeToPlaying":"1.01701603s"}
2026-04-29T01:04:36.222Z INFO egress info/io.go:273 egress_active
{"nodeID":"NE_Ld6efiWfXu4w","egressID":"EG_NXLkCKcy6YWg","requestType":"room_composite","outputType":"stream"}
2026-04-29T01:04:38.145Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9633333333333333,"requests":8}
2026-04-29T01:04:39.132Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.960669456080048,"requests":8}
2026-04-29T01:04:40.143Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9591666666666663,"requests":8}
2026-04-29T01:04:41.135Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9684123025558992,"requests":8}
2026-04-29T01:04:42.155Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9691923397255453,"requests":8}
2026-04-29T01:04:47.153Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9766860949296086,"requests":8}
2026-04-29T01:04:47.158Z WARN egress stats/monitor.go:641 killing egress due to sustained high cpu
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9766860949296086}
2026-04-29T01:05:04.162Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9502487562145029,"requests":8}
2026-04-29T01:05:04.163Z WARN egress stats/monitor.go:641 killing egress due to sustained high cpu
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9502487562145029}
2026-04-29T01:05:16.139Z WARN egress stats/monitor.go:629 high cpu usage
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9592346089835386,"requests":8}
2026-04-29T01:05:16.139Z WARN egress stats/monitor.go:641 killing egress due to sustained high cpu
{"nodeID":"NE_Ld6efiWfXu4w","cpu":0.9592346089835386}
Possible cause / analysis
After reviewing pkg/stats/monitor.go, I suspect the admission logic can underestimate real CPU during egress warmup.
The two new requests were accepted before their real CPU cost was visible in process stats. A few seconds later, when Chrome/GStreamer/RTMP encoding started, CPU jumped above the configured max_cpu_utilization.
Potential contributing factors:
pendingCPU is reset by the cpuHoldDuration timer, but if real CPU samples are still low or missing during warmup, the monitor may admit more work than the node can actually handle.
- The first request path appears to use full CPU capacity instead of
total * MaxCpuUtilization.
pendingCPU reset appears to happen from a timer goroutine without holding m.mu, while other paths read it under lock.
I can provide a PR for items (1) and (2), but would like maintainer guidance on how to handle (3), especially whether pendingCPU should remain as a floor until real CPU samples are available.
Describe the bug
I'm running LiveKit Egress through OpenVidu V3 and observed a case where the egress node exceeded the configured CPU admission limit and started killing egress sessions due to sustained high CPU.
The egress config has:
However, during a burst of room-composite egress starts on a single egress node, the service accepted new requests and shortly after reported CPU usage around 95–99% with
requests: 8.The service then started logging:
This affected egresses that were already active on the same node.
Based on reviewing
pkg/stats/monitor.go, I believe this is not necessarily a concurrent double-accept race, sinceAcceptRequestis serialized withm.mu. Instead, the issue appears to be caused by CPU admission/accounting edge cases:pendingCPUis reset from atime.AfterFunccallback without holdingm.mu, while other goroutines read it underm.mu.m.requests.Load() == 0, the first request appears to be checked against total CPU, nottotal * MaxCpuUtilization.cpuHoldDuration = 15smay release the CPU reservation before the real CPU cost of Chrome/GStreamer/RTMP encoding is visible in the process stats. During warmup,lastCPUcan still be too low, allowing additional starts that later push the node abovemax_cpu_utilization.In the observed incident, two new room-composite egresses were accepted almost at the same time. They became active within a few seconds, and shortly after the node reported sustained high CPU and started killing egresses.
Egress Version
Running through OpenVidu V3.
OpenVidu component observed in logs:
The exact LiveKit Egress image/version still needs to be confirmed from the deployment.
Related log path references show LiveKit Egress code paths such as:
and gst-go dependency:
Egress Request
Two room-composite stream egresses were started almost at the same time on the same egress node.
Request 1
{ "RoomComposite": { "room_name": "room-1", "layout": "PRESENTERS_ONLY", "custom_base_url": "https://<redacted-custom-layout-url>", "Output": { "Stream": { "protocol": 1, "urls": ["rtmp://<redacted>/live/<redacted>"] } }, "Options": { "Preset": 2 } } }Request 2
{ "RoomComposite": { "room_name": "room-2", "layout": "PRESENTERS_ONLY", "custom_base_url": "https://<redacted-custom-layout-url>", "Output": { "Stream": { "protocol": 1, "urls": ["rtmp://<redacted>/live/<redacted>"] } }, "Options": { "Preset": 2 } } }Additional context
This was a single-node egress instance used by OpenVidu V3.
At the time of the incident, the node already had multiple active egress requests. The two new egresses were accepted at
2026-04-29T01:04:33Zand became active at2026-04-29T01:04:36Z.Immediately after that, the egress service reported:
Then the egress service started killing egresses due to sustained high CPU.
Important timing:
Later, affected egresses eventually ended with
Closed Remotely, but that was a delayed final state. The operational issue started around2026-04-29T01:04:38Z, when CPU admission/capacity was exceeded.The relevant event on the egress node was sustained high CPU followed by LiveKit Egress killing sessions.
Logs
Relevant sanitized logs:
Possible cause / analysis
After reviewing
pkg/stats/monitor.go, I suspect the admission logic can underestimate real CPU during egress warmup.The two new requests were accepted before their real CPU cost was visible in process stats. A few seconds later, when Chrome/GStreamer/RTMP encoding started, CPU jumped above the configured
max_cpu_utilization.Potential contributing factors:
pendingCPUis reset by thecpuHoldDurationtimer, but if real CPU samples are still low or missing during warmup, the monitor may admit more work than the node can actually handle.total * MaxCpuUtilization.pendingCPUreset appears to happen from a timer goroutine without holdingm.mu, while other paths read it under lock.I can provide a PR for items (1) and (2), but would like maintainer guidance on how to handle (3), especially whether
pendingCPUshould remain as a floor until real CPU samples are available.