Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/ros2_medkit_fault_manager/CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@
Changelog for package ros2_medkit_fault_manager
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Forthcoming
-----------
* Optional append-only, hash-chained audit log of fault state transitions: each transition appends one immutable row (``record_hash = sha256(prev_hash + canonical(event))`` via OpenSSL EVP SHA-256) with a persisted chain head, a ``verify`` routine, a read API, and retention that seals a segment anchor before pruning. Time-based (PREFAILED->CONFIRMED) auto-confirmations are also audited. ``verify`` reads the chain head directly from the database, so deleting the newest row together with the head row is reported as tampering instead of silently recovering. ``BEFORE UPDATE`` / ``BEFORE DELETE`` triggers reject out-of-band edits as defense-in-depth. The chain is unkeyed and stored in a single writable file, so ``verify`` detects edits/deletions that did not recompute the chain (casual or accidental tampering); it is not a defence against an attacker who can rewrite the whole file. Off by default (`#483 <https://github.com/selfpatch/ros2_medkit/issues/483>`_)

0.6.0 (2026-06-22)
------------------
* Bounded concurrent snapshot capture under fault storms with a ``CaptureThreadPool`` and configurable capture pool / queue / overflow-policy parameters. The rosbag leg is serialized and the cooldown map is bounded, so a burst of simultaneous faults can no longer exhaust capture threads or grow memory without limit (`#456 <https://github.com/selfpatch/ros2_medkit/pull/456>`_)
Expand Down
8 changes: 8 additions & 0 deletions src/ros2_medkit_fault_manager/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ find_package(ros2_medkit_msgs REQUIRED)
find_package(ros2_medkit_serialization REQUIRED)
find_package(SQLite3 REQUIRED)
find_package(nlohmann_json REQUIRED)
# OpenSSL EVP SHA-256 for the tamper-evident audit log hash chain
find_package(OpenSSL REQUIRED)
# yaml-cpp is required as transitive dependency from ros2_medkit_serialization
medkit_find_yaml_cpp()
# rosbag2 for time-window snapshot recording
Expand All @@ -55,6 +57,7 @@ add_library(fault_manager_lib STATIC
src/fault_manager_node.cpp
src/fault_storage.cpp
src/sqlite_fault_storage.cpp
src/fault_audit_log.cpp
src/snapshot_capture.cpp
src/rosbag_capture.cpp
src/correlation/types.cpp
Expand All @@ -81,6 +84,7 @@ target_link_libraries(fault_manager_lib PUBLIC
SQLite::SQLite3
nlohmann_json::nlohmann_json
yaml-cpp::yaml-cpp
OpenSSL::Crypto
)

medkit_apply_compat_defs(fault_manager_lib)
Expand Down Expand Up @@ -143,6 +147,10 @@ if(BUILD_TESTING)
medkit_target_dependencies(test_sqlite_storage rclcpp ros2_medkit_msgs)
medkit_set_test_domain(test_sqlite_storage)

# Fault audit log tests (hash chain, verify, rotation, reopen)
ament_add_gtest(test_fault_audit_log test/test_fault_audit_log.cpp)
target_link_libraries(test_fault_audit_log fault_manager_lib)

# Snapshot capture tests
ament_add_gtest(test_snapshot_capture test/test_snapshot_capture.cpp)
target_link_libraries(test_snapshot_capture fault_manager_lib)
Expand Down
23 changes: 23 additions & 0 deletions src/ros2_medkit_fault_manager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
- **Debounce filtering** (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
- **Snapshot capture**: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
- **Fault correlation** (optional): Root cause analysis with symptom muting and auto-clear
- **Tamper-evident audit log** (optional): Append-only, hash-chained record of fault state transitions for verifiable history

## Parameters

Expand Down Expand Up @@ -109,6 +110,28 @@ patterns:

**Memory**: Faults are stored in memory only. Useful for testing or when persistence is not required.

## Advanced: Tamper-Evident Audit Log

An optional append-only, hash-chained audit log records every fault state transition (`occurred`, `confirmed`, `healed`, `cleared`) so the fault history is independently verifiable. Auto-recovery (a fault reaching the healing threshold via PASSED events) is recorded as a distinct `healed` row with source `auto_heal`, so the fault's END is in the timeline and is not confused with a manual `cleared`. The manager has no acknowledge action separate from clearing, so `~/clear_fault` is recorded as `cleared` (clear == ack); there is no `ack` kind. The log also records its own lifecycle with `logging_activated` / `logging_deactivated` markers at start and stop. It is **off by default** because it adds a write and storage cost per transition.

Each transition appends one immutable row holding `record_hash = sha256(prev_hash + canonical(event))` (OpenSSL EVP SHA-256), the `prev_hash` it links to, and a monotonic `seq`. The hash is computed once at insert and never recomputed. A persisted chain head lets the chain resume across restarts. The log is stored in its own SQLite database (separate from the fault store) and is treated as append-only: the manager only ever inserts rows, and `BEFORE UPDATE` / `BEFORE DELETE` triggers reject out-of-band edits (the guarded rotation prune excepted).

**Completeness is an integrity property.** `verify()` proves nothing was *deleted* from the chain, but it cannot prove a transition that was *never appended*. So a silently dropped append is a hole `verify()` can never see. Every transition on the write path is therefore audited (occurred, timer/threshold confirmations, auto-heal, and clears), and an append failure is never swallowed silently: it increments a dropped-writes counter and clears an "audit healthy" flag. **These are in-process signals only** (C++ getters on the node). This revision exposes no service/REST/health endpoint that surfaces audit health or lets an operator run `verify()` at runtime, so the signals are **not operator-observable at runtime yet** - a runtime read/verify/health surface is future work. With `audit_log.fail_closed` set, an append failure is re-raised as a **fail-FAST** error so a compliance-strict deployment learns the audit broke. This does **not** roll back the fault-state change that already committed: the audit log and the fault store are **separate SQLite databases**, so there is no cross-DB atomicity, and `fail_closed` is a broken-audit alarm requiring operator action, not a rollback. The default (`fail_closed=false`) keeps fault processing running; either way the in-process signals record the gap.

`verify()` walks the persisted chain oldest-first and recomputes every link: editing a row breaks its `record_hash`, and deleting a row breaks the next row's `prev_hash` linkage. Deleting the newest row *while leaving the head untouched* is caught by the persisted-head check (the head is read straight from the DB). However, deleting the newest row(s) **and** repointing the head with a single `UPDATE audit_chain_head SET seq=..., record_hash=...` to the prior row's values costs no more than any other casual edit - it is **not** the "recompute the entire chain" the threat model below might suggest - and the truncated chain still verifies. There is no external record that a later `seq` ever existed, so this tail-truncation is undetectable by design.

**Threat model (read this).** The chain is **unkeyed**, and the head and segment anchors live in the **same writable SQLite file** as the rows. `verify()` therefore catches edits or deletions that did **not** also recompute the chain - that is, casual or accidental tampering, and the bookkeeping bugs that would otherwise lose records. The append-only triggers are defense-in-depth: `audit_log` rejects out-of-band UPDATE/DELETE, `audit_anchors` carries the same guard-gated triggers so an out-of-band INSERT/UPDATE/DELETE of an anchor is rejected too, and the rotation-prune guard (`audit_prune_guard`) is itself protected by a trigger so an external writer cannot simply flip it open and then delete a prefix (or forge an anchor) - that flip is only permitted from the in-process connection that holds a per-connection temp marker. The single-row chain head (`audit_chain_head`) is intentionally **not** trigger-protected (a trigger there would block the legitimate head update inside the append transaction); a casual edit or delete of the head is instead caught by `verify()` via the seq/hash/head-mismatch checks. None of this stops an attacker with write access to the file: such an attacker can create the same temp marker or drop the triggers, and recompute the entire chain (head and anchors included) to forge a self-consistent history - and cheaper still, the tail-truncation above and the forged prefix-truncation below need no recompute at all. The triggers are **not** a security boundary - this is tamper-**evident**, not tamper-**proof**. True tamper-*proofing* requires a key or signature over the head (so it cannot be recomputed without the key) or external anchoring of the head hash to an append-only store you do not control; both are out of scope here and belong to the audit-log exporter / signing follow-up.

**Retention/rotation**: when more than `audit_log.retention_max_records` rows are retained, the oldest segment is *sealed* (its final `seq` + hash are persisted as an anchor) and then pruned. The surviving tail still verifies because the oldest retained row links back to the sealed anchor. Only the anchor at the current prune boundary is kept - the same rotation drops older anchors - so `audit_anchors` stays bounded (one row) instead of growing one row per rotation. Because `verify()` treats any matching sealed anchor as a valid tail root, a **forged** prefix-truncation (an out-of-band actor deletes a prefix and inserts a matching anchor) is **indistinguishable** from legitimate pruning: "the surviving tail still verifies" therefore covers a forged truncation exactly as well as a real one. The guard-gated `audit_anchors` triggers raise the bar for this (casual/accidental only) but, like every trigger here, a write-capable adversary can drop them - so this stays tamper-**evident**, not tamper-**proof**.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `audit_log.enabled` | bool | `false` | Enable the tamper-evident audit log |
| `audit_log.transitions` | string | `"all"` | Which transitions to record: `"all"` (occurred/confirmed/healed/cleared) or `"confirmed_only"`. Lifecycle markers are always recorded. |
| `audit_log.database_path` | string | `""` | SQLite path. Empty => sibling `fault_audit.db` next to the fault DB (or `:memory:` for in-memory fault stores) |
| `audit_log.retention_max_records` | int | `0` | Seal + prune the oldest segment beyond this many retained records (0 = unlimited) |
| `audit_log.fail_closed` | bool | `false` | When `true`, an audit append failure is re-raised as a fail-FAST error signalling the audit chain is broken and needs operator action. It does **not** roll back the already-committed fault-state change (the fault store is a separate DB - no cross-DB atomicity). When `false`, the failure is logged and counted and fault processing continues. Either way the gap is recorded via the in-process dropped-writes / audit-healthy signals (not operator-observable at runtime yet). |

## Usage

### Launch
Expand Down
23 changes: 23 additions & 0 deletions src/ros2_medkit_fault_manager/config/fault_manager.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,26 @@ fault_manager:
# snapshots.capture_pool_size: 2 # max concurrent capture threads (>= 1)
# snapshots.capture_queue_depth: 16 # max pending captures (>= 1)
# snapshots.capture_queue_full_policy: reject_newest # reject_newest | drop_oldest

# Append-only, hash-chained audit log of fault state transitions
# (occurred/confirmed/healed/cleared). OFF by default: it adds a write +
# storage cost per transition. When enabled, each transition appends one
# immutable, hash-chained row, so verify() detects edits or deletions that did
# NOT also recompute the chain (casual/accidental tampering). The chain is
# unkeyed and lives in a single writable file, so it is NOT proof against an
# attacker who can rewrite the whole file; true tamper-proofing needs a signed
# head or external anchoring (out of scope here). See README "Threat model".
audit_log.enabled: false
# Which transitions to record: "all" or "confirmed_only".
# audit_log.transitions: all
# SQLite path. Empty => sibling "fault_audit.db" next to the fault DB
# (or :memory: when the fault store is in-memory).
# audit_log.database_path: ""
# Seal + prune the oldest segment beyond this many retained records
# (0 = unlimited). A sealed anchor keeps the surviving tail verifiable.
# audit_log.retention_max_records: 0
# Fail-closed (compliance-strict): when true, an audit append failure is a hard
# error that aborts the operation instead of being logged and dropped. Default
# false keeps fault processing running; either way a dropped-writes health
# counter makes the gap visible.
# audit_log.fail_closed: false
Loading
Loading