Skip to content

Commit 2cfb9ff

Browse files
committed
chore: add csi specific metrics
1 parent e1c2303 commit 2cfb9ff

File tree

5 files changed

+730
-90
lines changed

5 files changed

+730
-90
lines changed

deploy/example/metrics/README.md

Lines changed: 141 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,48 @@
1-
# Get Prometheus metrics from CSI driver
1+
2+
# Prometheus Metrics for CSI Disk Driver
23

34
## Metrics description
45

5-
The metrics emitted by the Azure Disk CSI Driver fall broadly into two categories: CSI and Azure Cloud operation latency metrics. The CSI metrics record the latency of the CSI calls made to the driver, e,g, `ControllerPublishVolume`. The Azure Cloud metrics record the latency of Azure Cloud operations perform as part driver operation, e.g. `attach_disk`. The individual operation metrics are recorded in two different [histogram](https://prometheus.io/docs/concepts/metric_types/#histogram) metrics using the labels `request` and/or `source` to differentiate among the operations. The table below describes the values of the individual operation metrics.
6+
The Azure Disk CSI Driver exposes comprehensive Prometheus metrics for monitoring driver operations, performance, and health. The metrics fall into several categories:
7+
8+
### CSI Metrics
9+
10+
| Metric Name | Type | Labels | Description |
11+
|-------------|------|--------|-------------|
12+
| `azuredisk_csi_driver_operations_total` | Counter | `operation`, `success` | Total number of CSI operations (both controller and node) |
13+
| `azuredisk_csi_driver_operation_duration_seconds` | Histogram | `operation`, `success` | Duration of CSI operations in seconds (basic metric without detailed labels) |
14+
| `azuredisk_csi_driver_operation_duration_seconds_labeled` | Histogram | `operation`, `success`, `disk_sku`, `caching_mode`, `zone` | Duration of CSI operations in seconds with detailed labels for analysis |
15+
16+
**Operation Types (Controller):**
17+
- `controller_create_volume` - Create a new disk volume
18+
- `controller_delete_volume` - Delete a disk volume
19+
- `controller_modify_volume` - Modify volume properties
20+
- `controller_publish_volume` - Attach disk to node (VM)
21+
- `controller_unpublish_volume` - Detach disk from node (VM)
22+
- `controller_expand_volume` - Expand volume capacity
23+
- `controller_create_snapshot` - Create a snapshot
24+
- `controller_delete_snapshot` - Delete a snapshot
25+
26+
**Operation Types (Node):**
27+
- `node_stage_volume` - Stage volume to global mount path
28+
- `node_unstage_volume` - Unstage volume from global mount path
29+
- `node_publish_volume` - Mount volume to pod path
30+
- `node_unpublish_volume` - Unmount volume from pod path
31+
- `node_expand_volume` - Expand filesystem on node
32+
33+
**Label Values:**
34+
- `success`: `true` or `false`
35+
- `disk_sku`: Disk SKU type (e.g., `Premium_LRS`, `StandardSSD_LRS`, `Standard_LRS`)
36+
- `caching_mode`: Disk caching mode (e.g., `None`, `ReadOnly`, `ReadWrite`)
37+
- `zone`: Availability zone (e.g., `1`, `2`, `3`, or empty for non-zonal)
38+
39+
### Azure Cloud API Metrics
40+
41+
The driver also exposes Azure Cloud provider API metrics:
642

743
| Name | `request` | `source` | Description |
844
|------|-----------|----------|-------------|
9-
| `cloudprovider_azure_op_duration_seconds` | | | Records the CSI operation metrics |
10-
| | `azuredisk_csi_driver_controller_create_volume` | `disk.csi.azure.com` | `ControllerCreateVolume` latency |
11-
| | `azuredisk_csi_driver_controller_delete_volume` | `disk.csi.azure.com` | `ControllerDeleteVolume` latency |
12-
| | `azuredisk_csi_driver_controller_expand_volume` | `disk.csi.azure.com` | `ControllerExpandVolume` latency |
13-
| | `azuredisk_csi_driver_controller_create_snapshot` | `disk.csi.azure.com` | `ControllerCreateSnapshot` latency |
14-
| | `azuredisk_csi_driver_controller_delete_snapshot` | `disk.csi.azure.com` | `ControllerDeleteSnapshot` latency |
15-
| | `azuredisk_csi_driver_controller_publish_volume` | `disk.csi.azure.com` | `ControllerPublishVolume` latency |
16-
| | `azuredisk_csi_driver_controller_unpublish_volume` | `disk.csi.azure.com` | `ControllerUnpublishVolume` latency |
17-
| `cloudprovider_azure_api_request_duration_seconds` | | | Records the Azure Cloud operation metrics |
45+
| `cloudprovider_azure_api_request_duration_seconds` | | | Records the Azure Cloud API operation metrics |
1846
| | `disks_create_or_update` | | `create_disk` latency |
1947
| | `disks_delete` | | `delete_disk` latency |
2048
| | `disks_update` | | `resize_disk` latency |
@@ -37,65 +65,135 @@ The first creates a `Service` object that exposes the default Azure Disk CSI Dri
3765

3866
## Direct scraping
3967

40-
To scrape metrics directly from the Azure Disk CSI Driver controller, first get the leader node for one of the CSI sidecars depending on the metrics you wish to observe:
41-
42-
| Sidecar | Lease Lock Name | CSI Metrics | Azure Cloud Metrics |
43-
|---------|-----------------|-------------|---------------------|
44-
| `external-provisioner` | `disk-csi-azure-com` | `ControllerCreateVolume` & `ControllerDeleteVolume` | `create_disk` & `delete_disk` |
45-
| `external-attacher` | `external-attacher-leader-disk-csi-azure-com` | `ControllerPublishVolume` & `ControllerUnpublishVolume` | `attach_disk` & `detach_disk` |
46-
| `external-resizer` | `external-resizer-disk-csi-azure-com` | `ControllerExpandVolume` | `resize_disk` |
47-
| `external-snapshotter` | `external-snapshotter-leader-disk-csi-azure-com` | `ControllerCreateSnapshot` & `ControllerDeleteSnapshot` | `create_snapshot` & `delete_snapshot` |
48-
49-
The leader sidecar communicates with the Azure Disk CSI Driver on the same node to manage Azure Managed Disks. Once you determine which set of metrics you want to scrape, use the leader election lease name to find the current leader and set up a local port forwarder to the Azure Disk CSI Driver's metrics port. For example, run the following commands to set up port forwarding to the Azure Disk CSI Driver metrics server in the pod with the `external-attacher` leader:
68+
To scrape metrics directly from the Azure Disk CSI Driver controller, set up port forwarding to access the metrics endpoint. For example:
5069

5170
```console
52-
LEADER_LEASE=external-attacher-leader-disk-csi-azure-com
53-
LEADER_NODE=$(kubectl get lease -n kube-system "${LEADER_LEASE}" --output jsonpath='{.spec.holderIdentity}')
54-
LEADER_POD=$(kubectl get pod -n kube-system -l app=csi-azuredisk-controller --output jsonpath="{.items[?(@.spec.nodeName==\"${LEADER_NODE}\")].metadata.name}")
55-
kubectl port-forward -n kube-system "pods/${LEADER_POD}" 29604:29604 &
71+
CONTROLLER_POD=$(kubectl get pod -n kube-system -l app=csi-azuredisk-controller --output jsonpath='{.items[0].metadata.name}')
72+
kubectl port-forward -n kube-system "pods/${CONTROLLER_POD}" 29604:29604 &
5673
PORTFORWARDER=$!
5774
```
5875

59-
This example gets the name of the node holding the CSI `external-attacher` leader election lease lock, finds the name of the Azure Disk CSI Driver pod containing the leader and sets up port forwarding to the metrics server on localhost port 29604. After waiting for the port forwarder to initialize and begin serving requests, you can then get the metrics.
76+
This sets up port forwarding to the Azure Disk CSI Driver metrics server on localhost port 29604. After waiting for the port forwarder to initialize and begin serving requests, you can then get the metrics.
6077

6178
The format of the data returned by the Azure Disk CSI Driver metrics server is described in [Prometheus Exposition Formats](https://prometheus.io/docs/instrumenting/exposition_formats/).
6279

63-
### Example: Get `ControllerPublishVolume` and `ControllerUnpublishVolume` metrics
80+
### Example: Query CSI Operations
6481

65-
Once you have set up port forwarding, you can use the following command to get the `ControllerPublishVolume` and `ControlUnpublishVolume` metrics.
82+
Get total operations count by type and result:
6683

6784
```console
68-
curl http://localhost:29604/metrics | grep -E "cloudprovider_azure_op_duration_seconds_(sum|count)" | grep -E "request=\"azuredisk_csi_driver_controller_(un)?publish_volume\""
85+
curl http://localhost:29604/metrics | grep "azuredisk_csi_driver_operations_total"
86+
```
87+
88+
Sample output:
6989
```
90+
azuredisk_csi_driver_operations_total{operation="controller_create_volume",success="true"} 15
91+
azuredisk_csi_driver_operations_total{operation="controller_delete_volume",success="true"} 8
92+
azuredisk_csi_driver_operations_total{operation="controller_publish_volume",success="true"} 20
93+
azuredisk_csi_driver_operations_total{operation="controller_publish_volume",success="false"} 2
94+
azuredisk_csi_driver_operations_total{operation="node_stage_volume",success="true"} 18
95+
azuredisk_csi_driver_operations_total{operation="node_publish_volume",success="true"} 25
96+
```
97+
98+
99+
### Example: Query Operation Duration with Labels
70100

71-
We can calculate the average latency of each operation by dividing its `*_sum` by `*_count` metric. The `*_sum` value is in seconds. The following output shows an average `ControllerPublishVolume` latency of 9.5s and `ControllerUnpublishVolume` of 13.7s.
101+
Get detailed operation metrics including disk_sku, caching_mode, and zone:
102+
103+
```console
104+
curl http://localhost:29604/metrics | grep "azuredisk_csi_driver_operation_duration_seconds_labeled"
105+
```
72106

107+
Sample output showing histogram buckets:
73108
```
74-
cloudprovider_azure_op_duration_seconds_sum{request="azuredisk_csi_driver_controller_publish_volume",resource_group="edreed-k8s-failover-rg",source="disk.csi.azure.com",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 181.10639633399998
75-
cloudprovider_azure_op_duration_seconds_count{request="azuredisk_csi_driver_controller_publish_volume",resource_group="edreed-k8s-failover-rg",source="disk.csi.azure.com",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 19
76-
cloudprovider_azure_op_duration_seconds_sum{request="azuredisk_csi_driver_controller_unpublish_volume",resource_group="edreed-k8s-failover-rg",source="disk.csi.azure.com",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 232.39884008299998
77-
cloudprovider_azure_op_duration_seconds_count{request="azuredisk_csi_driver_controller_unpublish_volume",resource_group="edreed-k8s-failover-rg",source="disk.csi.azure.com",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 17
109+
azuredisk_csi_driver_operation_duration_seconds_labeled_bucket{caching_mode="ReadOnly",disk_sku="Premium_LRS",operation="controller_create_volume",success="true",zone="1",le="1"} 5
110+
azuredisk_csi_driver_operation_duration_seconds_labeled_bucket{caching_mode="ReadOnly",disk_sku="Premium_LRS",operation="controller_create_volume",success="true",zone="1",le="5"} 10
111+
azuredisk_csi_driver_operation_duration_seconds_labeled_sum{caching_mode="ReadOnly",disk_sku="Premium_LRS",operation="controller_create_volume",success="true",zone="1"} 12.5
112+
azuredisk_csi_driver_operation_duration_seconds_labeled_count{caching_mode="ReadOnly",disk_sku="Premium_LRS",operation="controller_create_volume",success="true",zone="1"} 10
78113
```
79114

80-
### Example: Get `attach_disk` and `detach_disk` metrics
115+
### Example: Monitor Operation Duration
116+
117+
Query basic operation duration metrics:
81118

82119
```console
83-
curl http://localhost:29604/metrics | grep -E "cloudprovider_azure_api_request_duration_seconds_(sum|count)" | grep -E "source=\"(attach_disk|detach_disk)\""
120+
curl http://localhost:29604/metrics | grep "azuredisk_csi_driver_operation_duration_seconds" | grep -v "labeled"
121+
```
122+
123+
Sample output showing histogram buckets for operations:
124+
```
125+
azuredisk_csi_driver_operation_duration_seconds_bucket{operation="controller_create_volume",success="true",le="1"} 8
126+
azuredisk_csi_driver_operation_duration_seconds_bucket{operation="controller_create_volume",success="true",le="5"} 15
127+
azuredisk_csi_driver_operation_duration_seconds_bucket{operation="controller_create_volume",success="true",le="10"} 15
128+
azuredisk_csi_driver_operation_duration_seconds_sum{operation="controller_create_volume",success="true"} 45.234
129+
azuredisk_csi_driver_operation_duration_seconds_count{operation="controller_create_volume",success="true"} 15
84130
```
85131

86-
To calculate the average `attach_disk` latency, we must sum the initiation and completion wait latencies. These are 3.6363432579999997 and 176.880914393, respectively, in the output below. The sum is 180.5172576509999997. We then divide by the count from either `attach_disk` metric since the two latencies represent one total operation. We see 14 `attach_disk` operations, so the average latency is 12.9s. The average `detach_disk` latency is 13.6s.
132+
### Example: Get Azure API Metrics for Disk Operations
133+
134+
Query Azure API latencies for disk create, delete, and resize operations:
87135

88136
```console
89-
cloudprovider_azure_api_request_duration_seconds_sum{request="vmss_wait_for_update_result",resource_group="edreed-k8s-failover-rg",source="attach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 176.880914393
90-
cloudprovider_azure_api_request_duration_seconds_count{request="vmss_wait_for_update_result",resource_group="edreed-k8s-failover-rg",source="attach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 14
91-
cloudprovider_azure_api_request_duration_seconds_sum{request="vmssvm_update",resource_group="edreed-k8s-failover-rg",source="detach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 231.78358075499997
92-
cloudprovider_azure_api_request_duration_seconds_count{request="vmssvm_update",resource_group="edreed-k8s-failover-rg",source="detach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 17
93-
cloudprovider_azure_api_request_duration_seconds_sum{request="vmssvm_updateasync",resource_group="edreed-k8s-failover-rg",source="attach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 3.6363432579999997
94-
cloudprovider_azure_api_request_duration_seconds_count{request="vmssvm_updateasync",resource_group="edreed-k8s-failover-rg",source="attach_disk",subscription_id="d64ddb0c-7399-4529-a2b6-037b33265372"} 14
137+
curl http://localhost:29604/metrics | grep -E "cloudprovider_azure_api_request_duration_seconds_(sum|count)" | grep -E "request=\"(disks_create_or_update|disks_delete|disks_update)\""
138+
```
139+
140+
Sample output:
141+
```
142+
cloudprovider_azure_api_request_duration_seconds_sum{request="disks_create_or_update",resource_group="mc_myaks_myaks_eastus",subscription_id="12345678-1234-1234-1234-123456789012"} 45.234
143+
cloudprovider_azure_api_request_duration_seconds_count{request="disks_create_or_update",resource_group="mc_myaks_myaks_eastus",subscription_id="12345678-1234-1234-1234-123456789012"} 15
144+
cloudprovider_azure_api_request_duration_seconds_sum{request="disks_delete",resource_group="mc_myaks_myaks_eastus",subscription_id="12345678-1234-1234-1234-123456789012"} 23.567
145+
cloudprovider_azure_api_request_duration_seconds_count{request="disks_delete",resource_group="mc_myaks_myaks_eastus",subscription_id="12345678-1234-1234-1234-123456789012"} 8
146+
```
147+
148+
### Example: Get `attach_disk` and `detach_disk` Azure API Metrics
149+
150+
Query Azure API latencies for VM attach/detach operations:
151+
152+
```console
153+
curl http://localhost:29604/metrics | grep -E "cloudprovider_azure_api_request_duration_seconds_(sum|count)" | grep -E "source=\"(attach_disk|detach_disk)\""
154+
```
155+
156+
To calculate the average `attach_disk` latency, sum the initiation and completion wait latencies, then divide by the count:
157+
158+
```
159+
cloudprovider_azure_api_request_duration_seconds_sum{request="vmss_wait_for_update_result",resource_group="mc_myaks_myaks_eastus",source="attach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 176.880914393
160+
cloudprovider_azure_api_request_duration_seconds_count{request="vmss_wait_for_update_result",resource_group="mc_myaks_myaks_eastus",source="attach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 14
161+
cloudprovider_azure_api_request_duration_seconds_sum{request="vmssvm_update",resource_group="mc_myaks_myaks_eastus",source="detach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 231.78358075499997
162+
cloudprovider_azure_api_request_duration_seconds_count{request="vmssvm_update",resource_group="mc_myaks_myaks_eastus",source="detach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 17
163+
cloudprovider_azure_api_request_duration_seconds_sum{request="vmssvm_updateasync",resource_group="mc_myaks_myaks_eastus",source="attach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 3.6363432579999997
164+
cloudprovider_azure_api_request_duration_seconds_count{request="vmssvm_updateasync",resource_group="mc_myaks_myaks_eastus",source="attach_disk",subscription_id="12345678-1234-1234-1234-123456789012"} 14
95165
```
96166

97167
Stop port forwarding with the following command:
98168

99169
```console
100170
kill -9 $PORTFORWARDER
101171
```
172+
173+
## Grafana Dashboard
174+
175+
You can create a Grafana dashboard to visualize these metrics. Here are some useful PromQL queries:
176+
177+
### Operation Success Rate
178+
```promql
179+
sum(rate(azuredisk_csi_driver_operations_total{success="true"}[5m])) by (operation) /
180+
sum(rate(azuredisk_csi_driver_operations_total[5m])) by (operation) * 100
181+
```
182+
183+
### Average Operation Duration
184+
```promql
185+
rate(azuredisk_csi_driver_operation_duration_seconds_sum[5m]) /
186+
rate(azuredisk_csi_driver_operation_duration_seconds_count[5m])
187+
```
188+
189+
### Average Duration by Disk SKU
190+
```promql
191+
rate(azuredisk_csi_driver_operation_duration_seconds_labeled_sum[5m]) /
192+
rate(azuredisk_csi_driver_operation_duration_seconds_labeled_count[5m]) by (disk_sku)
193+
```
194+
195+
### Average Duration by Zone
196+
```promql
197+
rate(azuredisk_csi_driver_operation_duration_seconds_labeled_sum[5m]) /
198+
rate(azuredisk_csi_driver_operation_duration_seconds_labeled_count[5m]) by (zone)
199+
```

0 commit comments

Comments
 (0)