You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The metrics emitted by the Azure Disk CSI Driver fall broadly into two categories: CSI and Azure Cloud operation latency metrics. The CSI metrics record the latency of the CSI calls made to the driver, e,g, `ControllerPublishVolume`. The Azure Cloud metrics record the latency of Azure Cloud operations perform as part driver operation, e.g. `attach_disk`. The individual operation metrics are recorded in two different [histogram](https://prometheus.io/docs/concepts/metric_types/#histogram) metrics using the labels `request` and/or `source` to differentiate among the operations. The table below describes the values of the individual operation metrics.
6
+
The Azure Disk CSI Driver exposes comprehensive Prometheus metrics for monitoring driver operations, performance, and health. The metrics fall into several categories:
7
+
8
+
### CSI Metrics
9
+
10
+
| Metric Name | Type | Labels | Description |
11
+
|-------------|------|--------|-------------|
12
+
|`azuredisk_csi_driver_operations_total`| Counter |`operation`, `success`| Total number of CSI operations (both controller and node) |
13
+
|`azuredisk_csi_driver_operation_duration_seconds`| Histogram |`operation`, `success`| Duration of CSI operations in seconds (basic metric without detailed labels) |
14
+
|`azuredisk_csi_driver_operation_duration_seconds_labeled`| Histogram |`operation`, `success`, `disk_sku`, `caching_mode`, `zone`| Duration of CSI operations in seconds with detailed labels for analysis |
15
+
16
+
**Operation Types (Controller):**
17
+
-`controller_create_volume` - Create a new disk volume
18
+
-`controller_delete_volume` - Delete a disk volume
@@ -37,65 +65,135 @@ The first creates a `Service` object that exposes the default Azure Disk CSI Dri
37
65
38
66
## Direct scraping
39
67
40
-
To scrape metrics directly from the Azure Disk CSI Driver controller, first get the leader node for one of the CSI sidecars depending on the metrics you wish to observe:
The leader sidecar communicates with the Azure Disk CSI Driver on the same node to manage Azure Managed Disks. Once you determine which set of metrics you want to scrape, use the leader election lease name to find the current leader and set up a local port forwarder to the Azure Disk CSI Driver's metrics port. For example, run the following commands to set up port forwarding to the Azure Disk CSI Driver metrics server in the pod with the `external-attacher` leader:
68
+
To scrape metrics directly from the Azure Disk CSI Driver controller, set up port forwarding to access the metrics endpoint. For example:
LEADER_NODE=$(kubectl get lease -n kube-system "${LEADER_LEASE}" --output jsonpath='{.spec.holderIdentity}')
54
-
LEADER_POD=$(kubectl get pod -n kube-system -l app=csi-azuredisk-controller --output jsonpath="{.items[?(@.spec.nodeName==\"${LEADER_NODE}\")].metadata.name}")
This example gets the name of the node holding the CSI `external-attacher` leader election lease lock, finds the name of the Azure Disk CSI Driver pod containing the leader and sets up port forwarding to the metrics server on localhost port 29604. After waiting for the port forwarder to initialize and begin serving requests, you can then get the metrics.
76
+
This sets up port forwarding to the Azure Disk CSI Driver metrics server on localhost port 29604. After waiting for the port forwarder to initialize and begin serving requests, you can then get the metrics.
60
77
61
78
The format of the data returned by the Azure Disk CSI Driver metrics server is described in [Prometheus Exposition Formats](https://prometheus.io/docs/instrumenting/exposition_formats/).
62
79
63
-
### Example: Get `ControllerPublishVolume` and `ControllerUnpublishVolume` metrics
80
+
### Example: Query CSI Operations
64
81
65
-
Once you have set up port forwarding, you can use the following command to get the `ControllerPublishVolume`and `ControlUnpublishVolume` metrics.
We can calculate the average latency of each operation by dividing its `*_sum` by `*_count` metric. The `*_sum` value is in seconds. The following output shows an average `ControllerPublishVolume` latency of 9.5s and `ControllerUnpublishVolume` of 13.7s.
101
+
Get detailed operation metrics including disk_sku, caching_mode, and zone:
To calculate the average `attach_disk` latency, we must sum the initiation and completion wait latencies. These are 3.6363432579999997 and 176.880914393, respectively, in the output below. The sum is 180.5172576509999997. We then divide by the count from either `attach_disk` metric since the two latencies represent one total operation. We see 14 `attach_disk` operations, so the average latency is 12.9s. The average `detach_disk` latency is 13.6s.
132
+
### Example: Get Azure API Metrics for Disk Operations
133
+
134
+
Query Azure API latencies for disk create, delete, and resize operations:
0 commit comments