Skip to content

Commit 2941d32

Browse files
committed
chore: add csi metrics
1 parent e1c2303 commit 2941d32

File tree

5 files changed

+671
-104
lines changed

5 files changed

+671
-104
lines changed

deploy/example/metrics/README.md

Lines changed: 140 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,47 @@
1-
# Get Prometheus metrics from CSI driver
1+
# Prometheus Metrics for CSI Disk Driver
22

33
## Metrics description
44

5-
The metrics emitted by the Azure Disk CSI Driver fall broadly into two categories: CSI and Azure Cloud operation latency metrics. The CSI metrics record the latency of the CSI calls made to the driver, e,g, `ControllerPublishVolume`. The Azure Cloud metrics record the latency of Azure Cloud operations perform as part driver operation, e.g. `attach_disk`. The individual operation metrics are recorded in two different [histogram](https://prometheus.io/docs/concepts/metric_types/#histogram) metrics using the labels `request` and/or `source` to differentiate among the operations. The table below describes the values of the individual operation metrics.
5+
The Azure Disk CSI Driver exposes comprehensive Prometheus metrics for monitoring driver operations, performance, and health. The metrics fall into several categories:
6+
7+
### CSI Metrics
8+
9+
| Metric Name | Type | Labels | Description |
10+
|-------------|------|--------|-------------|
11+
| `azuredisk_csi_driver_operations_total` | Counter | `operation`, `success` | Total number of CSI operations (both controller and node) |
12+
| `azuredisk_csi_driver_operation_duration_seconds` | Histogram | `operation`, `success` | Duration of CSI operations in seconds (basic metric without detailed labels) |
13+
| `azuredisk_csi_driver_operation_duration_seconds_labeled` | Histogram | `operation`, `success`, `disk_sku`, `caching_mode`, `zone` | Duration of CSI operations in seconds with detailed labels for analysis |
14+
15+
**Operation Types (Controller):**
16+
- `create_volume` - Create a new disk volume
17+
- `controller_delete_volume` - Delete a disk volume
18+
- `controller_modify_volume` - Modify volume properties
19+
- `controller_publish_volume` - Attach disk to node (VM)
20+
- `controller_unpublish_volume` - Detach disk from node (VM)
21+
- `controller_expand_volume` - Expand volume capacity
22+
- `controller_create_snapshot` - Create a snapshot
23+
- `controller_delete_snapshot` - Delete a snapshot
24+
25+
**Operation Types (Node):**
26+
- `node_stage_volume` - Stage volume to global mount path
27+
- `node_unstage_volume` - Unstage volume from global mount path
28+
- `node_publish_volume` - Mount volume to pod path
29+
- `node_unpublish_volume` - Unmount volume from pod path
30+
- `node_expand_volume` - Expand filesystem on node
31+
32+
**Label Values:**
33+
- `success`: `true` or `false`
34+
- `disk_sku`: Disk SKU type (e.g., `Premium_LRS`, `StandardSSD_LRS`, `Standard_LRS`)
35+
- `caching_mode`: Disk caching mode (e.g., `None`, `ReadOnly`, `ReadWrite`)
36+
- `zone`: Availability zone (e.g., `1`, `2`, `3`, or empty for non-zonal)
37+
38+
### Azure Cloud API Metrics
39+
40+
The driver also exposes Azure Cloud provider API metrics:
641

742
| Name | `request` | `source` | Description |
843
|------|-----------|----------|-------------|
9-
| `cloudprovider_azure_op_duration_seconds` | | | Records the CSI operation metrics |
10-
| | `azuredisk_csi_driver_controller_create_volume` | `disk.csi.azure.com` | `ControllerCreateVolume` latency |
11-
| | `azuredisk_csi_driver_controller_delete_volume` | `disk.csi.azure.com` | `ControllerDeleteVolume` latency |
12-
| | `azuredisk_csi_driver_controller_expand_volume` | `disk.csi.azure.com` | `ControllerExpandVolume` latency |
13-
| | `azuredisk_csi_driver_controller_create_snapshot` | `disk.csi.azure.com` | `ControllerCreateSnapshot` latency |
14-
| | `azuredisk_csi_driver_controller_delete_snapshot` | `disk.csi.azure.com` | `ControllerDeleteSnapshot` latency |
15-
| | `azuredisk_csi_driver_controller_publish_volume` | `disk.csi.azure.com` | `ControllerPublishVolume` latency |
16-
| | `azuredisk_csi_driver_controller_unpublish_volume` | `disk.csi.azure.com` | `ControllerUnpublishVolume` latency |
17-
| `cloudprovider_azure_api_request_duration_seconds` | | | Records the Azure Cloud operation metrics |
44+
| `cloudprovider_azure_api_request_duration_seconds` | | | Records the Azure Cloud API operation metrics |
1845
| | `disks_create_or_update` | | `create_disk` latency |
1946
| | `disks_delete` | | `delete_disk` latency |
2047
| | `disks_update` | | `resize_disk` latency |
@@ -37,65 +64,135 @@ The first creates a `Service` object that exposes the default Azure Disk CSI Dri
3764

3865
## Direct scraping
3966

40-
To scrape metrics directly from the Azure Disk CSI Driver controller, first get the leader node for one of the CSI sidecars depending on the metrics you wish to observe:
41-
42-
| Sidecar | Lease Lock Name | CSI Metrics | Azure Cloud Metrics |
43-
|---------|-----------------|-------------|---------------------|
44-
| `external-provisioner` | `disk-csi-azure-com` | `ControllerCreateVolume` & `ControllerDeleteVolume` | `create_disk` & `delete_disk` |
45-
| `external-attacher` | `external-attacher-leader-disk-csi-azure-com` | `ControllerPublishVolume` & `ControllerUnpublishVolume` | `attach_disk` & `detach_disk` |
46-
| `external-resizer` | `external-resizer-disk-csi-azure-com` | `ControllerExpandVolume` | `resize_disk` |
47-
| `external-snapshotter` | `external-snapshotter-leader-disk-csi-azure-com` | `ControllerCreateSnapshot` & `ControllerDeleteSnapshot` | `create_snapshot` & `delete_snapshot` |
48-
49-
The leader sidecar communicates with the Azure Disk CSI Driver on the same node to manage Azure Managed Disks. Once you determine which set of metrics you want to scrape, use the leader election lease name to find the current leader and set up a local port forwarder to the Azure Disk CSI Driver's metrics port. For example, run the following commands to set up port forwarding to the Azure Disk CSI Driver metrics server in the pod with the `external-attacher` leader:
67+
To scrape metrics directly from the Azure Disk CSI Driver controller, set up port forwarding to access the metrics endpoint. For example:
5068

5169
```console
52-
LEADER_LEASE=external-attacher-leader-disk-csi-azure-com
53-
LEADER_NODE=$(kubectl get lease -n kube-system "${LEADER_LEASE}" --output jsonpath='{.spec.holderIdentity}')
54-
LEADER_POD=$(kubectl get pod -n kube-system -l app=csi-azuredisk-controller --output jsonpath="{.items[?(@.spec.nodeName==\"${LEADER_NODE}\")].metadata.name}")
55-
kubectl port-forward -n kube-system "pods/${LEADER_POD}" 29604:29604 &
70+
CONTROLLER_POD=$(kubectl get pod -n kube-system -l app=csi-azuredisk-controller --output jsonpath='{.items[0].metadata.name}')
71+
kubectl port-forward -n kube-system "pods/${CONTROLLER_POD}" 29604:29604 &
5672
PORTFORWARDER=$!
5773
```
5874

59-
This example gets the name of the node holding the CSI `external-attacher` leader election lease lock, finds the name of the Azure Disk CSI Driver pod containing the leader and sets up port forwarding to the metrics server on localhost port 29604. After waiting for the port forwarder to initialize and begin serving requests, you can then get the metrics.
75+
This sets up port forwarding to the Azure Disk CSI Driver metrics server on localhost port 29604. After waiting for the port forwarder to initialize and begin serving requests, you can then get the metrics.
6076

6177
The format of the data returned by the Azure Disk CSI Driver metrics server is described in [Prometheus Exposition Formats](https://prometheus.io/docs/instrumenting/exposition_formats/).
6278

63-
### Example: Get `ControllerPublishVolume` and `ControllerUnpublishVolume` metrics
79+
### Example: Query CSI Operations
6480

65-
Once you have set up port forwarding, you can use the following command to get the `ControllerPublishVolume` and `ControlUnpublishVolume` metrics.
81+
Get total operations count by type and result:
6682

6783
```console
68-
curl http://localhost:29604/metrics | grep -E "cloudprovider_azure_op_duration_seconds_(sum|count)" | grep -E "request=\"azuredisk_csi_driver_controller_(un)?publish_volume\""
84+
curl http://localhost:29604/metrics | grep "azuredisk_csi_driver_operations_total"
85+
```
86+
87+
Sample output:
88+
```
89+
azuredisk_csi_driver_operations_total{operation="create_volume",success="true"} 15
90+
azuredisk_csi_driver_operations_total{operation="controller_delete_volume",success="true"} 8
91+
azuredisk_csi_driver_operations_total{operation="controller_publish_volume",success="true"} 20
92+
azuredisk_csi_driver_operations_total{operation="controller_publish_volume",success="false"} 2
93+
azuredisk_csi_driver_operations_total{operation="node_stage_volume",success="true"} 18
94+
azuredisk_csi_driver_operations_total{operation="node_publish_volume",success="true"} 25
6995
```
7096

71-
We can calculate the average latency of each operation by dividing its `*_sum` by `*_count` metric. The `*_sum` value is in seconds. The following output shows an average `ControllerPublishVolume` latency of 9.5s and `ControllerUnpublishVolume` of 13.7s.
7297

98+
### Example: Query Operation Duration with Labels
99+
100+
Get detailed operation metrics including disk_sku, caching_mode, and zone:
101+
102+
```console
103+
curl http://localhost:29604/metrics | grep "azuredisk_csi_driver_operation_duration_seconds_labeled"
73104
```
74-
cloudprovider_azure_op_duration_seconds_sum{request="azuredisk_csi_driver_controller_publish_volume",resource_group="edreed-k8s-failover-rg",source="disk.csi.azure.com",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 181.10639633399998
75-
cloudprovider_azure_op_duration_seconds_count{request="azuredisk_csi_driver_controller_publish_volume",resource_group="edreed-k8s-failover-rg",source="disk.csi.azure.com",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 19
76-
cloudprovider_azure_op_duration_seconds_sum{request="azuredisk_csi_driver_controller_unpublish_volume",resource_group="edreed-k8s-failover-rg",source="disk.csi.azure.com",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 232.39884008299998
77-
cloudprovider_azure_op_duration_seconds_count{request="azuredisk_csi_driver_controller_unpublish_volume",resource_group="edreed-k8s-failover-rg",source="disk.csi.azure.com",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 17
105+
106+
Sample output showing histogram buckets:
107+
```
108+
azuredisk_csi_driver_operation_duration_seconds_labeled_bucket{caching_mode="ReadOnly",disk_sku="Premium_LRS",operation="create_volume",success="true",zone="1",le="1"} 5
109+
azuredisk_csi_driver_operation_duration_seconds_labeled_bucket{caching_mode="ReadOnly",disk_sku="Premium_LRS",operation="create_volume",success="true",zone="1",le="5"} 10
110+
azuredisk_csi_driver_operation_duration_seconds_labeled_sum{caching_mode="ReadOnly",disk_sku="Premium_LRS",operation="create_volume",success="true",zone="1"} 12.5
111+
azuredisk_csi_driver_operation_duration_seconds_labeled_count{caching_mode="ReadOnly",disk_sku="Premium_LRS",operation="create_volume",success="true",zone="1"} 10
78112
```
79113

80-
### Example: Get `attach_disk` and `detach_disk` metrics
114+
### Example: Monitor Operation Duration
115+
116+
Query basic operation duration metrics:
81117

82118
```console
83-
curl http://localhost:29604/metrics | grep -E "cloudprovider_azure_api_request_duration_seconds_(sum|count)" | grep -E "source=\"(attach_disk|detach_disk)\""
119+
curl http://localhost:29604/metrics | grep "azuredisk_csi_driver_operation_duration_seconds" | grep -v "labeled"
120+
```
121+
122+
Sample output showing histogram buckets for operations:
84123
```
124+
azuredisk_csi_driver_operation_duration_seconds_bucket{operation="create_volume",success="true",le="1"} 8
125+
azuredisk_csi_driver_operation_duration_seconds_bucket{operation="create_volume",success="true",le="5"} 15
126+
azuredisk_csi_driver_operation_duration_seconds_bucket{operation="create_volume",success="true",le="10"} 15
127+
azuredisk_csi_driver_operation_duration_seconds_sum{operation="create_volume",success="true"} 45.234
128+
azuredisk_csi_driver_operation_duration_seconds_count{operation="create_volume",success="true"} 15
129+
```
130+
131+
### Example: Get Azure API Metrics for Disk Operations
85132

86-
To calculate the average `attach_disk` latency, we must sum the initiation and completion wait latencies. These are 3.6363432579999997 and 176.880914393, respectively, in the output below. The sum is 180.5172576509999997. We then divide by the count from either `attach_disk` metric since the two latencies represent one total operation. We see 14 `attach_disk` operations, so the average latency is 12.9s. The average `detach_disk` latency is 13.6s.
133+
Query Azure API latencies for disk create, delete, and resize operations:
87134

88135
```console
89-
cloudprovider_azure_api_request_duration_seconds_sum{request="vmss_wait_for_update_result",resource_group="edreed-k8s-failover-rg",source="attach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 176.880914393
90-
cloudprovider_azure_api_request_duration_seconds_count{request="vmss_wait_for_update_result",resource_group="edreed-k8s-failover-rg",source="attach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 14
91-
cloudprovider_azure_api_request_duration_seconds_sum{request="vmssvm_update",resource_group="edreed-k8s-failover-rg",source="detach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 231.78358075499997
92-
cloudprovider_azure_api_request_duration_seconds_count{request="vmssvm_update",resource_group="edreed-k8s-failover-rg",source="detach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 17
93-
cloudprovider_azure_api_request_duration_seconds_sum{request="vmssvm_updateasync",resource_group="edreed-k8s-failover-rg",source="attach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 3.6363432579999997
94-
cloudprovider_azure_api_request_duration_seconds_count{request="vmssvm_updateasync",resource_group="edreed-k8s-failover-rg",source="attach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 14
136+
curl http://localhost:29604/metrics | grep -E "cloudprovider_azure_api_request_duration_seconds_(sum|count)" | grep -E "request=\"(disks_create_or_update|disks_delete|disks_update)\""
137+
```
138+
139+
Sample output:
140+
```
141+
cloudprovider_azure_api_request_duration_seconds_sum{request="disks_create_or_update",resource_group="mc_myaks_myaks_eastus",subscription_id="12345678-1234-1234-1234-123456789012"} 45.234
142+
cloudprovider_azure_api_request_duration_seconds_count{request="disks_create_or_update",resource_group="mc_myaks_myaks_eastus",subscription_id="12345678-1234-1234-1234-123456789012"} 15
143+
cloudprovider_azure_api_request_duration_seconds_sum{request="disks_delete",resource_group="mc_myaks_myaks_eastus",subscription_id="12345678-1234-1234-1234-123456789012"} 23.567
144+
cloudprovider_azure_api_request_duration_seconds_count{request="disks_delete",resource_group="mc_myaks_myaks_eastus",subscription_id="12345678-1234-1234-1234-123456789012"} 8
145+
```
146+
147+
### Example: Get `attach_disk` and `detach_disk` Azure API Metrics
148+
149+
Query Azure API latencies for VM attach/detach operations:
150+
151+
```console
152+
curl http://localhost:29604/metrics | grep -E "cloudprovider_azure_api_request_duration_seconds_(sum|count)" | grep -E "source=\"(attach_disk|detach_disk)\""
153+
```
154+
155+
To calculate the average `attach_disk` latency, sum the initiation and completion wait latencies, then divide by the count:
156+
157+
```
158+
cloudprovider_azure_api_request_duration_seconds_sum{request="vmss_wait_for_update_result",resource_group="mc_myaks_myaks_eastus",source="attach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 176.880914393
159+
cloudprovider_azure_api_request_duration_seconds_count{request="vmss_wait_for_update_result",resource_group="mc_myaks_myaks_eastus",source="attach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 14
160+
cloudprovider_azure_api_request_duration_seconds_sum{request="vmssvm_update",resource_group="mc_myaks_myaks_eastus",source="detach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 231.78358075499997
161+
cloudprovider_azure_api_request_duration_seconds_count{request="vmssvm_update",resource_group="mc_myaks_myaks_eastus",source="detach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 17
162+
cloudprovider_azure_api_request_duration_seconds_sum{request="vmssvm_updateasync",resource_group="mc_myaks_myaks_eastus",source="attach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 3.6363432579999997
163+
cloudprovider_azure_api_request_duration_seconds_count{request="vmssvm_updateasync",resource_group="mc_myaks_myaks_eastus",source="attach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 14
95164
```
96165

97166
Stop port forwarding with the following command:
98167

99168
```console
100169
kill -9 $PORTFORWARDER
101170
```
171+
172+
## Grafana Dashboard
173+
174+
You can create a Grafana dashboard to visualize these metrics. Here are some useful PromQL queries:
175+
176+
### Operation Success Rate
177+
```promql
178+
sum(rate(azuredisk_csi_driver_operations_total{success="true"}[5m])) by (operation) /
179+
sum(rate(azuredisk_csi_driver_operations_total[5m])) by (operation) * 100
180+
```
181+
182+
### Average Operation Duration
183+
```promql
184+
rate(azuredisk_csi_driver_operation_duration_seconds_sum[5m]) /
185+
rate(azuredisk_csi_driver_operation_duration_seconds_count[5m])
186+
```
187+
188+
### Average Duration by Disk SKU
189+
```promql
190+
rate(azuredisk_csi_driver_operation_duration_seconds_labeled_sum[5m]) /
191+
rate(azuredisk_csi_driver_operation_duration_seconds_labeled_count[5m]) by (disk_sku)
192+
```
193+
194+
### Operation Rate by Zone
195+
```promql
196+
sum(rate(azuredisk_csi_driver_operations_total[5m])) by (zone)
197+
```
198+

0 commit comments

Comments
 (0)