feat(ai-proxy): add max_stream_duration_ms and max_response_bytes safeguards#13250
Merged
Baoyuantop merged 3 commits intoapache:masterfrom Apr 20, 2026
Merged
Conversation
…eguards Adds two opt-in configuration knobs to ai-proxy and ai-proxy-multi plugins to protect the gateway from runaway upstream LLM services: - max_stream_duration_ms: wall-clock cap on total streaming response duration. When exceeded, the upstream connection is force-closed. - max_response_bytes: cap on total bytes read from upstream for a single response (streaming or non-streaming). For non-streaming responses, pre-checks Content-Length; for streaming, enforces after each chunk. The existing `timeout` field only bounds per-socket-operation timeouts (connect/send/read block), which does not protect against an upstream that continuously emits valid SSE events forever. That failure mode can pin a worker at 100% CPU indefinitely and degrade availability for other traffic on the same worker. Both fields are opt-in (no default); existing deployments are unaffected. When a limit is hit mid-stream after bytes have been flushed to the client, the gateway stops feeding chunks and closes the upstream connection; the client observes a truncated SSE stream (missing the protocol-specific terminator such as [DONE], message_stop, or response.completed). When the limit is hit before any output has been produced (e.g. the converter has skipped all upstream events so far), 504 is returned so on_error / fallback policies can kick in. Adds an integration test with a mock upstream that either streams forever or returns an oversized Content-Length, and a schema validation case.
24a1433 to
913fa6a
Compare
There was a problem hiding this comment.
Pull request overview
Adds opt-in safeguards to the ai-proxy / ai-proxy-multi gateway to prevent runaway upstream LLM responses by enforcing maximum streaming duration and maximum upstream response bytes.
Changes:
- Add
max_stream_duration_msandmax_response_bytesto plugin schemas and documentation (EN/ZH). - Enforce stream duration/byte limits in streaming response parsing, and add
Content-Length/ body-size checks for non-streaming responses. - Add a new test suite covering streaming aborts and non-streaming oversized
Content-Length.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
apisix/plugins/ai-proxy/base.lua |
Thread plugin config into provider parsing and propagate non-streaming parse status. |
apisix/plugins/ai-providers/base.lua |
Implement stream duration/byte abort logic and non-streaming size checks. |
apisix/plugins/ai-proxy/schema.lua |
Add new config knobs to ai-proxy and ai-proxy-multi schemas. |
t/plugin/ai-proxy-stream-limits.t |
Add regression tests for the new stream/size limits and schema validation. |
docs/en/latest/plugins/ai-proxy.md |
Document new knobs and clarify timeout semantics. |
docs/en/latest/plugins/ai-proxy-multi.md |
Document new knobs and clarify timeout semantics. |
docs/zh/latest/plugins/ai-proxy.md |
Chinese doc updates for new knobs and timeout clarification. |
docs/zh/latest/plugins/ai-proxy-multi.md |
Chinese doc updates for new knobs and timeout clarification. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…on-streaming Previously parse_response called res:read_body() first and then checked the size after, which meant a runaway chunked upstream could force the worker to buffer arbitrarily many bytes before the cap tripped. Switch to reading via res.body_reader() when max_response_bytes is set, so the cap is enforced as bytes arrive, matching the streaming path's behavior.
…nused require
Address Copilot review:
- parse_streaming_response returns (status, error_message) when the new
stream limits trip before any bytes are flushed; capture both at the
call site so on_error / fallback handlers can see the reason instead
of just the status code.
- Remove an unused require('apisix.core') from TEST 2.
moonming
approved these changes
Apr 18, 2026
membphis
approved these changes
Apr 20, 2026
membphis
approved these changes
Apr 20, 2026
Baoyuantop
approved these changes
Apr 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Adds two opt-in configuration knobs to
ai-proxyandai-proxy-multito protect the gateway from a runaway upstream LLM service:max_stream_duration_ms— wall-clock cap on total streaming response duration.max_response_bytes— cap on total bytes read from the upstream for a single response (streaming or non-streaming).Both are opt-in (no default) — existing deployments are unaffected.
Why
The existing
timeoutfield is fed tohttpc:set_timeout(), which is a per-socket-operation timeout (connect / send / read-one-block). It does not bound the total duration of a streaming response. If an upstream LLM has a bug that causes it to continuously emit valid SSE tokens without ever sending a terminator ([DONE],message_stop,response.completed),parse_streaming_responsesits in an uncappedwhile trueloop, pinning the worker at ~100% CPU indefinitely and degrading availability for all other traffic on that worker.Behavior on abort
close()+res._httpc = nil, so we don't pool a half-drained connection). nginx closes the downstream connection at end of content phase. The client detects truncation via the missing protocol-specific terminator. We intentionally do not synthesize a per-protocol "graceful error" SSE frame: we support three client protocols (OpenAI chat, Anthropic messages, OpenAI responses) with different terminators, and a missing terminator is the standard SSE way any mid-stream network failure is communicated to clients.504(duration) or502(size) soon_error/ fallback / retry hooks can kick in like any other upstream failure.Content-Lengthexceeds cap: pre-check the header, force-close the connection, return502without ever reading the body.Content-Length: post-read size check catches the oversized body and returns502.ctx.var.llm_request_done = trueis set on abort so downstream filters (e.g. moderation plugins that defer work until completion) finalize their state.core.log.warnline is emitted on every abort (aborting AI stream: <limit> exceeded; bytes=X duration_ms=Y route_id=Z) so log-based alerting can surface the event. No new Prometheus metric — the log line is sufficient and avoids expanding the plugin's metric surface.Caveat (documented)
Both limits are best-effort: they are enforced after each chunk is read from the upstream, so the byte cap can overshoot by up to one upstream chunk (≈8 KiB in practice) and the duration cap can overshoot by up to one chunk's processing time. This is acceptable for the failure mode we are defending against (runaway streams produce tens of MB/s, so a one-chunk overshoot is negligible compared to "run forever").
Testing
New
t/plugin/ai-proxy-stream-limits.twith a mock upstream that either streams OpenAI chat SSE chunks forever (no[DONE]) or returns a 100 KB body with matchingContent-Length. Covers:max_stream_duration_ms=500→ request aborted in <5 s with the expected log line.max_response_bytes=2048→ request aborted in <5 s with the expected log line.max_response_bytes=1024vs 100 KB upstream response → 502 + expected log line.max_stream_duration_ms: 0.luacheckpasses on all three modified Lua files.Docs
Added rows to the config tables in
docs/en/latest/plugins/ai-proxy.md,ai-proxy-multi.md, and their Chinese translations, with a clarifying note thattimeoutonly bounds per-socket-operation timeouts and the new fields are needed to bound total stream duration / total bytes read.