feat(prometheus): support disabling labels and metrics to reduce cardinality#13202
feat(prometheus): support disabling labels and metrics to reduce cardinality#13202janiussyafiq wants to merge 8 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds per-metric label value collapsing to the Prometheus plugin to help reduce metric cardinality (without removing labels/metrics), plus docs and tests to cover the new behavior.
Changes:
- Implement
plugin_attr.prometheus.metrics.<metric>.disable_labelsto force selected built-in label values to""at emission time. - Add a new test covering
disable_labelsbehavior forhttp_status. - Document
disable_labelsin English/Chinese docs and the example config.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
apisix/plugins/prometheus/exporter.lua |
Applies per-metric disabled-label value collapsing when emitting Prometheus metrics. |
t/plugin/prometheus-label-filter.t |
Adds regression test ensuring disabled labels are emitted as empty-string values. |
docs/en/latest/plugins/prometheus.md |
Documents disable_labels usage in plugin_attr examples. |
docs/zh/latest/plugins/prometheus.md |
Documents disable_labels usage in plugin_attr examples (ZH). |
conf/config.yaml.example |
Adds example configuration snippet showing disable_labels. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for metric_name, metric_conf in pairs(metrics_conf) do | ||
| if metric_conf.disable_labels then | ||
| disabled_label_metric_map[metric_name] = {} | ||
| for _, label in ipairs(metric_conf.disable_labels) do | ||
| disabled_label_metric_map[metric_name][label] = true | ||
| end | ||
| end | ||
| end |
There was a problem hiding this comment.
disable_labels is iterated with ipairs(metric_conf.disable_labels) without any type checks. Because plugin_attr does not appear to be schema-validated, a misconfiguration (e.g. disable_labels: node as a string) will raise a runtime error on every request. Consider validating metrics_conf/metric_conf/disable_labels types before iterating (and logging + ignoring invalid config) to keep the datapath robust.
| for metric_name, metric_conf in pairs(metrics_conf) do | |
| if metric_conf.disable_labels then | |
| disabled_label_metric_map[metric_name] = {} | |
| for _, label in ipairs(metric_conf.disable_labels) do | |
| disabled_label_metric_map[metric_name][label] = true | |
| end | |
| end | |
| end | |
| if type(metrics_conf) ~= "table" then | |
| core.log.warn("invalid prometheus plugin_attr.metrics type: ", | |
| type(metrics_conf), ", expected table; ignoring") | |
| return disabled_label_metric_map | |
| end | |
| for metric_name, metric_conf in pairs(metrics_conf) do | |
| if type(metric_conf) ~= "table" then | |
| core.log.warn("invalid prometheus metric config for metric '", metric_name, | |
| "': expected table, got ", type(metric_conf), "; ignoring") | |
| elseif metric_conf.disable_labels ~= nil then | |
| if type(metric_conf.disable_labels) ~= "table" then | |
| core.log.warn("invalid prometheus disable_labels for metric '", metric_name, | |
| "': expected table, got ", type(metric_conf.disable_labels), "; ignoring") | |
| else | |
| disabled_label_metric_map[metric_name] = {} | |
| for _, label in ipairs(metric_conf.disable_labels) do | |
| disabled_label_metric_map[metric_name][label] = true | |
| end | |
| end | |
| end | |
| end |
| local metric_label_map = { | ||
| http_status = {"code", "route", "matched_uri", "matched_host", "service", "consumer", "node", | ||
| "request_type", "request_llm_model", "llm_model", "response_source"}, | ||
| http_latency = {"type", "route", "service", "consumer", "node", | ||
| "request_type", "request_llm_model", "llm_model"}, | ||
| bandwidth = {"type", "route", "service", "consumer", "node", | ||
| "request_type", "request_llm_model", "llm_model"}, | ||
| llm_latency = {"route_id", "service_id", "consumer", "node", | ||
| "request_type", "request_llm_model", "llm_model"}, | ||
| llm_prompt_tokens = {"route_id", "service_id", "consumer", "node", | ||
| "request_type", "request_llm_model", "llm_model"}, | ||
| llm_completion_tokens = {"route_id", "service_id", "consumer", "node", | ||
| "request_type", "request_llm_model", "llm_model"}, | ||
| llm_active_connections = {"route", "route_id", "matched_uri", "matched_host", | ||
| "service", "service_id", "consumer", "node", | ||
| "request_type", "request_llm_model", "llm_model"}, |
There was a problem hiding this comment.
metric_label_map duplicates the label-name lists already defined in http_init for each metric. This introduces a drift risk: if labels are added/reordered in the metric registration, disable_labels will silently stop applying to the intended label or apply to the wrong one. Consider defining the label-name arrays once (shared by both registration and filtering), or deriving the name list from the metric definition so ordering can’t diverge.
| local metric_label_map = { | |
| http_status = {"code", "route", "matched_uri", "matched_host", "service", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model", "response_source"}, | |
| http_latency = {"type", "route", "service", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"}, | |
| bandwidth = {"type", "route", "service", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"}, | |
| llm_latency = {"route_id", "service_id", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"}, | |
| llm_prompt_tokens = {"route_id", "service_id", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"}, | |
| llm_completion_tokens = {"route_id", "service_id", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"}, | |
| llm_active_connections = {"route", "route_id", "matched_uri", "matched_host", | |
| "service", "service_id", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"}, | |
| local HTTP_STATUS_LABELS = {"code", "route", "matched_uri", "matched_host", "service", "consumer", | |
| "node", "request_type", "request_llm_model", "llm_model", "response_source"} | |
| local HTTP_LATENCY_LABELS = {"type", "route", "service", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"} | |
| local BANDWIDTH_LABELS = {"type", "route", "service", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"} | |
| local LLM_LATENCY_LABELS = {"route_id", "service_id", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"} | |
| local LLM_PROMPT_TOKENS_LABELS = {"route_id", "service_id", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"} | |
| local LLM_COMPLETION_TOKENS_LABELS = {"route_id", "service_id", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"} | |
| local LLM_ACTIVE_CONNECTIONS_LABELS = {"route", "route_id", "matched_uri", "matched_host", | |
| "service", "service_id", "consumer", "node", | |
| "request_type", "request_llm_model", "llm_model"} | |
| local metric_label_map = { | |
| http_status = HTTP_STATUS_LABELS, | |
| http_latency = HTTP_LATENCY_LABELS, | |
| bandwidth = BANDWIDTH_LABELS, | |
| llm_latency = LLM_LATENCY_LABELS, | |
| llm_prompt_tokens = LLM_PROMPT_TOKENS_LABELS, | |
| llm_completion_tokens = LLM_COMPLETION_TOKENS_LABELS, | |
| llm_active_connections = LLM_ACTIVE_CONNECTIONS_LABELS, |
| function _M.http_log(conf, ctx) | ||
| local vars = ctx.var | ||
| local disabled_label_metric_map = get_disabled_label_metric_map() |
There was a problem hiding this comment.
get_disabled_label_metric_map() builds a fresh nested table on every request (http_log runs in the hot path). This adds avoidable allocations/GC pressure at high QPS (the exact scenario this feature targets). Consider caching the computed disabled-label map per worker and only recomputing when plugin_attr.prometheus.metrics changes (e.g. by tracking a config version/modifiedIndex or comparing the attr table reference).
| function _M.http_log(conf, ctx) | |
| local vars = ctx.var | |
| local disabled_label_metric_map = get_disabled_label_metric_map() | |
| local cached_prometheus_metrics_attr | |
| local cached_disabled_label_metric_map = {} | |
| local function get_cached_disabled_label_metric_map() | |
| local attr = plugin.plugin_attr("prometheus") | |
| local metrics_attr = attr and attr.metrics | |
| if metrics_attr ~= cached_prometheus_metrics_attr then | |
| cached_prometheus_metrics_attr = metrics_attr | |
| cached_disabled_label_metric_map = get_disabled_label_metric_map() or {} | |
| end | |
| return cached_disabled_label_metric_map | |
| end | |
| function _M.http_log(conf, ctx) | |
| local vars = ctx.var | |
| local disabled_label_metric_map = get_cached_disabled_label_metric_map() |
| local disabled_label_metric_map = get_disabled_label_metric_map() | ||
|
|
There was a problem hiding this comment.
Same hot-path allocation issue as in http_log: inc_llm_active_connections recomputes disabled_label_metric_map on each increment/decrement. If you implement caching for get_disabled_label_metric_map, it should be reused here as well to avoid extra per-request allocations.
| # disable_labels: # List of built-in label names to drop (reduces cardinality). | ||
| # - node # e.g. drop the upstream node IP label |
There was a problem hiding this comment.
The new disable_labels comment says "drop" labels, but the implementation/docs describe collapsing the label values to an empty string while keeping the label present (non-breaking). Wording here is misleading and could cause users to assume the label will be removed from the metric schema.
| # disable_labels: # List of built-in label names to drop (reduces cardinality). | |
| # - node # e.g. drop the upstream node IP label | |
| # disable_labels: # List of built-in label names whose values are set to empty strings (reduces cardinality while keeping labels). | |
| # - node # e.g. keep the upstream node IP label but set its value to an empty string |
Description
Add a new per-metric configuration option to the Prometheus plugin's
plugin_attr:disable_labels: A list of built-in label names whose values will be collapsed to an empty string""for a given metric, reducing cardinality without changing the metric schema.This is a non-breaking change — all labels remain registered and present in the output, so existing Prometheus dashboards and recording rules are unaffected. Only the label values of the specified labels are zeroed out.
An alternative approach of
disable: true(removing a metric entirely from/metrics) was considered but excluded from this PR as it is a breaking change — it would cause dashboards to show "No data", misfireabsent()alerts, and break recording rules. Instead, the same effect of suppressing all label cardinality on a metric can be achieved by listing all its labels underdisable_labels, which keeps the metric present in the output with empty-string values while remaining fully non-breaking.Configured under
plugin_attr.prometheus.metrics.<metric_name>alongside the existingexpireandextra_labelsfields.Example config:
With the above config,
nodeandconsumerwill appear asnode="",consumer=""in the metric output instead of carrying real values, effectively collapsing all time series that differ only by those labels.This addresses high-cardinality issues in dynamic environments (e.g. Kubernetes autoscaling where pod IPs churn rapidly), which can cause Prometheus shared dict overflow and excessive memory consumption.
Which issue(s) this PR fixes:
Fixes #12679
Checklist