When the delivery engine finishes with a message (delivered, bounced, or failed), it marks the queue entry complete in Redis first, then deletes the `.eml` file from disk. If the process crashes or the file deletion fails between those two steps, the file stays on disk forever.
```go
// delivery.go — success path
e.queue.Complete(ctx, msg.ID) // Redis: done
// ... other updates ...
if err := e.cleanupMessageFile(msg.MessagePath); err != nil {
logger.WarnContext(ctx, "Failed to cleanup message file", ...)
// continues — file is orphaned
}
```
The cleanup failure is logged as a warning but doesn't prevent the message from being marked as delivered. Over time, orphaned files accumulate in the queue directory.
How to reproduce
Hard to trigger intentionally, but it will happen naturally:
- Process killed (SIGKILL, OOM, power loss) between Complete() and cleanup
- File permission changes on the queue directory
- Disk full preventing the delete syscall
Suggestion
Add a periodic cleanup job that scans the queue directory for `.eml` files older than `retry_max_age` (7 days) that have no corresponding pending/processing entry in Redis. These are safe to delete. Could run alongside the existing recovery worker on the same 5-minute interval.
When the delivery engine finishes with a message (delivered, bounced, or failed), it marks the queue entry complete in Redis first, then deletes the `.eml` file from disk. If the process crashes or the file deletion fails between those two steps, the file stays on disk forever.
```go
// delivery.go — success path
e.queue.Complete(ctx, msg.ID) // Redis: done
// ... other updates ...
if err := e.cleanupMessageFile(msg.MessagePath); err != nil {
logger.WarnContext(ctx, "Failed to cleanup message file", ...)
// continues — file is orphaned
}
```
The cleanup failure is logged as a warning but doesn't prevent the message from being marked as delivered. Over time, orphaned files accumulate in the queue directory.
How to reproduce
Hard to trigger intentionally, but it will happen naturally:
Suggestion
Add a periodic cleanup job that scans the queue directory for `.eml` files older than `retry_max_age` (7 days) that have no corresponding pending/processing entry in Redis. These are safe to delete. Could run alongside the existing recovery worker on the same 5-minute interval.