Skip to content

feat(ai-proxy): add max_stream_duration_ms and max_response_bytes safeguards#13250

Merged
Baoyuantop merged 3 commits intoapache:masterfrom
nic-6443:fix/ai-proxy-stream-runaway-limits-1776435956
Apr 20, 2026
Merged

feat(ai-proxy): add max_stream_duration_ms and max_response_bytes safeguards#13250
Baoyuantop merged 3 commits intoapache:masterfrom
nic-6443:fix/ai-proxy-stream-runaway-limits-1776435956

Conversation

@nic-6443
Copy link
Copy Markdown
Member

What this does

Adds two opt-in configuration knobs to ai-proxy and ai-proxy-multi to protect the gateway from a runaway upstream LLM service:

  • max_stream_duration_ms — wall-clock cap on total streaming response duration.
  • max_response_bytes — cap on total bytes read from the upstream for a single response (streaming or non-streaming).

Both are opt-in (no default) — existing deployments are unaffected.

Why

The existing timeout field is fed to httpc:set_timeout(), which is a per-socket-operation timeout (connect / send / read-one-block). It does not bound the total duration of a streaming response. If an upstream LLM has a bug that causes it to continuously emit valid SSE tokens without ever sending a terminator ([DONE], message_stop, response.completed), parse_streaming_response sits in an uncapped while true loop, pinning the worker at ~100% CPU indefinitely and degrading availability for all other traffic on that worker.

Behavior on abort

  • Streaming, limit hit mid-stream (bytes already flushed): stop feeding chunks and force-close the upstream httpc (close() + res._httpc = nil, so we don't pool a half-drained connection). nginx closes the downstream connection at end of content phase. The client detects truncation via the missing protocol-specific terminator. We intentionally do not synthesize a per-protocol "graceful error" SSE frame: we support three client protocols (OpenAI chat, Anthropic messages, OpenAI responses) with different terminators, and a missing terminator is the standard SSE way any mid-stream network failure is communicated to clients.
  • Streaming, limit hit before any output: return 504 (duration) or 502 (size) so on_error / fallback / retry hooks can kick in like any other upstream failure.
  • Non-streaming, Content-Length exceeds cap: pre-check the header, force-close the connection, return 502 without ever reading the body.
  • Non-streaming, chunked / no Content-Length: post-read size check catches the oversized body and returns 502.
  • ctx.var.llm_request_done = true is set on abort so downstream filters (e.g. moderation plugins that defer work until completion) finalize their state.
  • A core.log.warn line is emitted on every abort (aborting AI stream: <limit> exceeded; bytes=X duration_ms=Y route_id=Z) so log-based alerting can surface the event. No new Prometheus metric — the log line is sufficient and avoids expanding the plugin's metric surface.

Caveat (documented)

Both limits are best-effort: they are enforced after each chunk is read from the upstream, so the byte cap can overshoot by up to one upstream chunk (≈8 KiB in practice) and the duration cap can overshoot by up to one chunk's processing time. This is acceptable for the failure mode we are defending against (runaway streams produce tens of MB/s, so a one-chunk overshoot is negligible compared to "run forever").

Testing

New t/plugin/ai-proxy-stream-limits.t with a mock upstream that either streams OpenAI chat SSE chunks forever (no [DONE]) or returns a 100 KB body with matching Content-Length. Covers:

  1. max_stream_duration_ms=500 → request aborted in <5 s with the expected log line.
  2. max_response_bytes=2048 → request aborted in <5 s with the expected log line.
  3. Non-streaming max_response_bytes=1024 vs 100 KB upstream response → 502 + expected log line.
  4. Schema validation rejects max_stream_duration_ms: 0.

luacheck passes on all three modified Lua files.

Docs

Added rows to the config tables in docs/en/latest/plugins/ai-proxy.md, ai-proxy-multi.md, and their Chinese translations, with a clarifying note that timeout only bounds per-socket-operation timeouts and the new fields are needed to bound total stream duration / total bytes read.

Copilot AI review requested due to automatic review settings April 17, 2026 14:48
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Apr 17, 2026
…eguards

Adds two opt-in configuration knobs to ai-proxy and ai-proxy-multi plugins
to protect the gateway from runaway upstream LLM services:

- max_stream_duration_ms: wall-clock cap on total streaming response
  duration. When exceeded, the upstream connection is force-closed.
- max_response_bytes: cap on total bytes read from upstream for a single
  response (streaming or non-streaming). For non-streaming responses,
  pre-checks Content-Length; for streaming, enforces after each chunk.

The existing `timeout` field only bounds per-socket-operation timeouts
(connect/send/read block), which does not protect against an upstream
that continuously emits valid SSE events forever. That failure mode
can pin a worker at 100% CPU indefinitely and degrade availability for
other traffic on the same worker.

Both fields are opt-in (no default); existing deployments are unaffected.

When a limit is hit mid-stream after bytes have been flushed to the
client, the gateway stops feeding chunks and closes the upstream
connection; the client observes a truncated SSE stream (missing the
protocol-specific terminator such as [DONE], message_stop, or
response.completed). When the limit is hit before any output has been
produced (e.g. the converter has skipped all upstream events so far),
504 is returned so on_error / fallback policies can kick in.

Adds an integration test with a mock upstream that either streams
forever or returns an oversized Content-Length, and a schema validation
case.
@nic-6443 nic-6443 force-pushed the fix/ai-proxy-stream-runaway-limits-1776435956 branch from 24a1433 to 913fa6a Compare April 17, 2026 14:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds opt-in safeguards to the ai-proxy / ai-proxy-multi gateway to prevent runaway upstream LLM responses by enforcing maximum streaming duration and maximum upstream response bytes.

Changes:

  • Add max_stream_duration_ms and max_response_bytes to plugin schemas and documentation (EN/ZH).
  • Enforce stream duration/byte limits in streaming response parsing, and add Content-Length / body-size checks for non-streaming responses.
  • Add a new test suite covering streaming aborts and non-streaming oversized Content-Length.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
apisix/plugins/ai-proxy/base.lua Thread plugin config into provider parsing and propagate non-streaming parse status.
apisix/plugins/ai-providers/base.lua Implement stream duration/byte abort logic and non-streaming size checks.
apisix/plugins/ai-proxy/schema.lua Add new config knobs to ai-proxy and ai-proxy-multi schemas.
t/plugin/ai-proxy-stream-limits.t Add regression tests for the new stream/size limits and schema validation.
docs/en/latest/plugins/ai-proxy.md Document new knobs and clarify timeout semantics.
docs/en/latest/plugins/ai-proxy-multi.md Document new knobs and clarify timeout semantics.
docs/zh/latest/plugins/ai-proxy.md Chinese doc updates for new knobs and timeout clarification.
docs/zh/latest/plugins/ai-proxy-multi.md Chinese doc updates for new knobs and timeout clarification.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apisix/plugins/ai-providers/base.lua Outdated
Comment thread apisix/plugins/ai-proxy/base.lua Outdated
Comment thread t/plugin/ai-proxy-stream-limits.t Outdated
Comment thread t/plugin/ai-proxy-stream-limits.t
…on-streaming

Previously parse_response called res:read_body() first and then checked
the size after, which meant a runaway chunked upstream could force the
worker to buffer arbitrarily many bytes before the cap tripped. Switch
to reading via res.body_reader() when max_response_bytes is set, so the
cap is enforced as bytes arrive, matching the streaming path's behavior.
…nused require

Address Copilot review:
- parse_streaming_response returns (status, error_message) when the new
  stream limits trip before any bytes are flushed; capture both at the
  call site so on_error / fallback handlers can see the reason instead
  of just the status code.
- Remove an unused require('apisix.core') from TEST 2.
@Baoyuantop Baoyuantop merged commit ecbb6fe into apache:master Apr 20, 2026
34 checks passed
@nic-6443 nic-6443 deleted the fix/ai-proxy-stream-runaway-limits-1776435956 branch April 20, 2026 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants