Skip to content

feat(prometheus): support disabling labels and metrics to reduce cardinality#13202

Open
janiussyafiq wants to merge 8 commits intoapache:masterfrom
janiussyafiq:feat/prom-disable-metrics
Open

feat(prometheus): support disabling labels and metrics to reduce cardinality#13202
janiussyafiq wants to merge 8 commits intoapache:masterfrom
janiussyafiq:feat/prom-disable-metrics

Conversation

@janiussyafiq
Copy link
Copy Markdown
Contributor

@janiussyafiq janiussyafiq commented Apr 13, 2026

Description

Add a new per-metric configuration option to the Prometheus plugin's plugin_attr:

  • disable_labels: A list of built-in label names whose values will be collapsed to an empty string "" for a given metric, reducing cardinality without changing the metric schema.

This is a non-breaking change — all labels remain registered and present in the output, so existing Prometheus dashboards and recording rules are unaffected. Only the label values of the specified labels are zeroed out.

An alternative approach of disable: true (removing a metric entirely from /metrics) was considered but excluded from this PR as it is a breaking change — it would cause dashboards to show "No data", misfire absent() alerts, and break recording rules. Instead, the same effect of suppressing all label cardinality on a metric can be achieved by listing all its labels under disable_labels, which keeps the metric present in the output with empty-string values while remaining fully non-breaking.

Configured under plugin_attr.prometheus.metrics.<metric_name> alongside the existing expire and extra_labels fields.

Example config:

plugin_attr:
  prometheus:
    metrics:
      http_status:
        disable_labels:
          - node
          - consumer
      http_latency:
        disable_labels:
          - node

With the above config, node and consumer will appear as node="",consumer="" in the metric output instead of carrying real values, effectively collapsing all time series that differ only by those labels.

This addresses high-cardinality issues in dynamic environments (e.g. Kubernetes autoscaling where pod IPs churn rapidly), which can cause Prometheus shared dict overflow and excessive memory consumption.

Which issue(s) this PR fixes:

Fixes #12679

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@janiussyafiq janiussyafiq marked this pull request as ready for review April 17, 2026 06:35
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Apr 17, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-metric label value collapsing to the Prometheus plugin to help reduce metric cardinality (without removing labels/metrics), plus docs and tests to cover the new behavior.

Changes:

  • Implement plugin_attr.prometheus.metrics.<metric>.disable_labels to force selected built-in label values to "" at emission time.
  • Add a new test covering disable_labels behavior for http_status.
  • Document disable_labels in English/Chinese docs and the example config.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
apisix/plugins/prometheus/exporter.lua Applies per-metric disabled-label value collapsing when emitting Prometheus metrics.
t/plugin/prometheus-label-filter.t Adds regression test ensuring disabled labels are emitted as empty-string values.
docs/en/latest/plugins/prometheus.md Documents disable_labels usage in plugin_attr examples.
docs/zh/latest/plugins/prometheus.md Documents disable_labels usage in plugin_attr examples (ZH).
conf/config.yaml.example Adds example configuration snippet showing disable_labels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +132 to +139
for metric_name, metric_conf in pairs(metrics_conf) do
if metric_conf.disable_labels then
disabled_label_metric_map[metric_name] = {}
for _, label in ipairs(metric_conf.disable_labels) do
disabled_label_metric_map[metric_name][label] = true
end
end
end
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable_labels is iterated with ipairs(metric_conf.disable_labels) without any type checks. Because plugin_attr does not appear to be schema-validated, a misconfiguration (e.g. disable_labels: node as a string) will raise a runtime error on every request. Consider validating metrics_conf/metric_conf/disable_labels types before iterating (and logging + ignoring invalid config) to keep the datapath robust.

Suggested change
for metric_name, metric_conf in pairs(metrics_conf) do
if metric_conf.disable_labels then
disabled_label_metric_map[metric_name] = {}
for _, label in ipairs(metric_conf.disable_labels) do
disabled_label_metric_map[metric_name][label] = true
end
end
end
if type(metrics_conf) ~= "table" then
core.log.warn("invalid prometheus plugin_attr.metrics type: ",
type(metrics_conf), ", expected table; ignoring")
return disabled_label_metric_map
end
for metric_name, metric_conf in pairs(metrics_conf) do
if type(metric_conf) ~= "table" then
core.log.warn("invalid prometheus metric config for metric '", metric_name,
"': expected table, got ", type(metric_conf), "; ignoring")
elseif metric_conf.disable_labels ~= nil then
if type(metric_conf.disable_labels) ~= "table" then
core.log.warn("invalid prometheus disable_labels for metric '", metric_name,
"': expected table, got ", type(metric_conf.disable_labels), "; ignoring")
else
disabled_label_metric_map[metric_name] = {}
for _, label in ipairs(metric_conf.disable_labels) do
disabled_label_metric_map[metric_name][label] = true
end
end
end
end

Copilot uses AI. Check for mistakes.
Comment on lines +109 to +124
local metric_label_map = {
http_status = {"code", "route", "matched_uri", "matched_host", "service", "consumer", "node",
"request_type", "request_llm_model", "llm_model", "response_source"},
http_latency = {"type", "route", "service", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
bandwidth = {"type", "route", "service", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
llm_latency = {"route_id", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
llm_prompt_tokens = {"route_id", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
llm_completion_tokens = {"route_id", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
llm_active_connections = {"route", "route_id", "matched_uri", "matched_host",
"service", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric_label_map duplicates the label-name lists already defined in http_init for each metric. This introduces a drift risk: if labels are added/reordered in the metric registration, disable_labels will silently stop applying to the intended label or apply to the wrong one. Consider defining the label-name arrays once (shared by both registration and filtering), or deriving the name list from the metric definition so ordering can’t diverge.

Suggested change
local metric_label_map = {
http_status = {"code", "route", "matched_uri", "matched_host", "service", "consumer", "node",
"request_type", "request_llm_model", "llm_model", "response_source"},
http_latency = {"type", "route", "service", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
bandwidth = {"type", "route", "service", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
llm_latency = {"route_id", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
llm_prompt_tokens = {"route_id", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
llm_completion_tokens = {"route_id", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
llm_active_connections = {"route", "route_id", "matched_uri", "matched_host",
"service", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"},
local HTTP_STATUS_LABELS = {"code", "route", "matched_uri", "matched_host", "service", "consumer",
"node", "request_type", "request_llm_model", "llm_model", "response_source"}
local HTTP_LATENCY_LABELS = {"type", "route", "service", "consumer", "node",
"request_type", "request_llm_model", "llm_model"}
local BANDWIDTH_LABELS = {"type", "route", "service", "consumer", "node",
"request_type", "request_llm_model", "llm_model"}
local LLM_LATENCY_LABELS = {"route_id", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"}
local LLM_PROMPT_TOKENS_LABELS = {"route_id", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"}
local LLM_COMPLETION_TOKENS_LABELS = {"route_id", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"}
local LLM_ACTIVE_CONNECTIONS_LABELS = {"route", "route_id", "matched_uri", "matched_host",
"service", "service_id", "consumer", "node",
"request_type", "request_llm_model", "llm_model"}
local metric_label_map = {
http_status = HTTP_STATUS_LABELS,
http_latency = HTTP_LATENCY_LABELS,
bandwidth = BANDWIDTH_LABELS,
llm_latency = LLM_LATENCY_LABELS,
llm_prompt_tokens = LLM_PROMPT_TOKENS_LABELS,
llm_completion_tokens = LLM_COMPLETION_TOKENS_LABELS,
llm_active_connections = LLM_ACTIVE_CONNECTIONS_LABELS,

Copilot uses AI. Check for mistakes.
Comment on lines 348 to +350
function _M.http_log(conf, ctx)
local vars = ctx.var
local disabled_label_metric_map = get_disabled_label_metric_map()
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_disabled_label_metric_map() builds a fresh nested table on every request (http_log runs in the hot path). This adds avoidable allocations/GC pressure at high QPS (the exact scenario this feature targets). Consider caching the computed disabled-label map per worker and only recomputing when plugin_attr.prometheus.metrics changes (e.g. by tracking a config version/modifiedIndex or comparing the attr table reference).

Suggested change
function _M.http_log(conf, ctx)
local vars = ctx.var
local disabled_label_metric_map = get_disabled_label_metric_map()
local cached_prometheus_metrics_attr
local cached_disabled_label_metric_map = {}
local function get_cached_disabled_label_metric_map()
local attr = plugin.plugin_attr("prometheus")
local metrics_attr = attr and attr.metrics
if metrics_attr ~= cached_prometheus_metrics_attr then
cached_prometheus_metrics_attr = metrics_attr
cached_disabled_label_metric_map = get_disabled_label_metric_map() or {}
end
return cached_disabled_label_metric_map
end
function _M.http_log(conf, ctx)
local vars = ctx.var
local disabled_label_metric_map = get_cached_disabled_label_metric_map()

Copilot uses AI. Check for mistakes.
Comment on lines +841 to +842
local disabled_label_metric_map = get_disabled_label_metric_map()

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same hot-path allocation issue as in http_log: inc_llm_active_connections recomputes disabled_label_metric_map on each increment/decrement. If you implement caching for get_disabled_label_metric_map, it should be reused here as well to avoid extra per-request allocations.

Copilot uses AI. Check for mistakes.
Comment thread conf/config.yaml.example
Comment on lines +638 to +639
# disable_labels: # List of built-in label names to drop (reduces cardinality).
# - node # e.g. drop the upstream node IP label
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new disable_labels comment says "drop" labels, but the implementation/docs describe collapsing the label values to an empty string while keeping the label present (non-breaking). Wording here is misleading and could cause users to assume the label will be removed from the metric schema.

Suggested change
# disable_labels: # List of built-in label names to drop (reduces cardinality).
# - node # e.g. drop the upstream node IP label
# disable_labels: # List of built-in label names whose values are set to empty strings (reduces cardinality while keeping labels).
# - node # e.g. keep the upstream node IP label but set its value to an empty string

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Apisix Prometheus metrics - Disable unneeded labels/metrics

2 participants