Skip to content

disk-smart: rewrite with JSON parsing, NVMe support, and split into specialized checks #1050

@markuslf

Description

@markuslf

The current disk-smart plugin has several limitations:

  • Text-only parsing of smartctl output, fragile and dependent on exact formatting
  • No NVMe support
  • No JSON support (available since smartctl 7.3, released 2022)
  • Hardcoded attribute checks via if/elif chains instead of a data-driven attribute database
  • ~900 lines in a single file with high cyclomatic complexity
  • When the error log triggers CRIT (which can happen quite often), additional SMART attribute issues go unnoticed because the check is already in CRIT status
  • Missing values in output, e.g. empty model family: * sda (, Samsung SSD 883 DCT 1.92TB, SerNo 12345678) (disk-smart: Missing value in output #291)
  • Unhelpful error messages on smartctl failures, no option to show smartctl output for debugging (disk-smart: Show smartctl output on failure #671)

Proposed changes

Split into specialized checks, e.g.:

  • disk-smart-attributes
  • disk-smart-error-log
  • disk-smart-health
  • disk-smart-self-tests
  • disk-smart-stats
  • disk-smart-temperature

This allows administrators to acknowledge or silence individual aspects (e.g. a noisy error log) without ignoring the overall disk health.

Icinga Director service set for the new split checks, including Journald Query and Systemd Unit for smartd.service (#604).

Shared smartctl cache via lib/cache: Each plugin checks if cached smartctl data is older than 8 hours. If so, it calls smartctl --xall --json and updates the cache. Otherwise it reads from cache. No separate collector plugin needed.

Shared helper library lib/disk_smart.py: Common functions across all disk-smart plugins (cache handling, smartctl invocation, JSON/text parsing, attribute database) are implemented in a shared library to avoid code duplication.

JSON parsing (smartctl 7.3+) with text fallback for older versions.

NVMe support, modeled after GSmartControl 2.0 which implements a comprehensive multi-device parser architecture. NVMe drives expose additional health data that should be evaluated, including:

  • Critical Warning flags
  • Available Spare / Available Spare Threshold
  • Percentage Used (device life estimate)
  • Unsafe Shutdowns
  • Media and Data Integrity Errors
  • Warning/Critical Composite Temperature Time
  • Temperature Sensors

Data-driven attribute database instead of hardcoded if/elif chains, inspired by GSmartControl's storage_property_descr_ata_attribute.cpp (~200 attributes with HDD/SSD distinction).

Improved error handling: Show smartctl output on failure for easier debugging.

Closes #4
Closes #222
Closes #291
Closes #568
Closes #604
Closes #671

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions