Skip to content

Fix S5 shutdown teardown: observe CloseAsync outcome + bound the wait (EventHub/ServiceBus)#203

Merged
JDetmar merged 4 commits into
masterfrom
fix/s5-shutdown-teardown
Jun 24, 2026
Merged

Fix S5 shutdown teardown: observe CloseAsync outcome + bound the wait (EventHub/ServiceBus)#203
JDetmar merged 4 commits into
masterfrom
fix/s5-shutdown-teardown

Conversation

@JDetmar

@JDetmar JDetmar commented Jun 23, 2026

Copy link
Copy Markdown
Owner

S5 — shutdown teardown on EventHubTarget / ServiceBusTarget

The audit (BUGHUNT.md, S5 remainder) flagged this as "shutdown log loss (flush abandoned)". Reproducing before writing tests corrected that premise. The headline "log loss" does not happen; two adjacent defects do.

Both targets had:

protected override void CloseTarget()
{
    var task = Task.Run(async () => await _xService.CloseAsync().ConfigureAwait(false));
    task.Wait(TimeSpan.FromMilliseconds(500));   // bool result discarded
    base.CloseTarget();
}

CloseAsync() closes the connection (EventHub _client.CloseAsync(); ServiceBus _sender.CloseAsync() then _client.DisposeAsync()), not the NLog queue — draining the queue is AsyncTaskTarget's job (neither target overrides FlushAsync). The connection was closed before base.CloseTarget().

Reproduction (instrumented fakes + LogFactory)

I added recording IEventHubService / ICloudServiceBus fakes that log the order of batch sends relative to CloseAsync, and can make CloseAsync slow and/or throw. Findings on origin/master:

# Question Result
1 Does NLog flush before or after CloseTarget? Any send over a closed client? Flushes BEFORE CloseTarget in every shutdown path (Shutdown(), Configuration = null, Dispose()). Timeline: `send
2 CloseAsync throws after the 500 ms wait elapses — observed anywhere? Silent unobserved TaskException (confirmed: UnobservedTaskException fired once, nothing in the internal log). Real defect.
3 CloseAsync throws within 500 ms — does Wait rethrow out of CloseTarget? Shutdown does not propagate it — NLog's Target.Close() already swallows CloseTarget exceptions. Throwing an AggregateException out of CloseTarget is still poor form and gets logged as a scary error.
4 Slow (800 ms) successful close — awaited or abandoned? Abandoned at the 500 ms cap (shutdownMs≈502, connection still closing). For ServiceBus the abandoned step is _client.DisposeAsync()leaked client. Real defect.

Fix (symmetric across both targets)

  • Drain, then close. Call base.CloseTarget() first so NLog flushes any remaining queued events over the still-open connection, then close the connection — never the reverse. (Defensive: repro showed NLog already flushes first, but this makes the ordering correct by construction.)
  • Always observe the outcome. A close that faults within the wait is caught and logged via InternalLogger; on timeout a fault-observing continuation is attached so a later fault can never become an unobserved task exception, plus a warning is logged. CloseTarget never throws.
  • Sensible bounded wait. The arbitrary 500 ms cap → a 5 s bound generous enough for a healthy-but-slow teardown. Because the outcome is always observed (even past the bound), the bound governs only how long shutdown blocks, never correctness — so the ServiceBus DisposeAsync() is no longer silently abandoned.
  • CloseTarget stays synchronous (NLog signature) with a correct sync-over-async bridge.

Tests (test-first; reverting the fix fails them)

Driven through the fakes + LogFactory:

  • Shutdown_WhenCloseAsyncThrows_IsLoggedAndDoesNotPropagate — a 900 ms-then-throwing close is logged via InternalLogger and does not propagate (was a silent unobserved exception). Fails pre-fix.
  • Shutdown_SlowCloseIsAwaitedNotAbandoned — an 800 ms close runs to completion during shutdown; for ServiceBus the client dispose completes too. Fails pre-fix (abandoned at 500 ms).
  • Shutdown_FlushesQueuedEventsBeforeClosingConnection — invariant guard: all events sent, none over a closed connection. Passes pre- and post-fix — documents the disproven premise; not a revert-fails test.

Verification

  • EventHubTarget tests: 10/10. ServiceBusTarget tests: 9/9 (DOTNET_ROLL_FORWARD=Major, net8 target on net9/10 runtimes).
  • Release build of both src projects across netstandard2.0; net8.0; net10.0: 0 warnings, 0 errors. TreatWarningsAsErrors not enabled.

🤖 Generated with Claude Code

JDetmar and others added 2 commits June 23, 2026 00:56
Reproduces the confirmed S5 defects via instrumented IEventHubService /
ICloudServiceBus fakes that record the order of batch sends relative to
CloseAsync and can make CloseAsync slow and/or throw:

- Shutdown_WhenCloseAsyncThrows_IsLoggedAndDoesNotPropagate: a CloseAsync
  that faults after the hard 500ms wait becomes a silent unobserved task
  exception (not logged) today. FAILS pre-fix.
- Shutdown_SlowCloseIsAwaitedNotAbandoned: an 800ms close is abandoned by
  the 500ms cap; for ServiceBus this also leaks the undisposed client.
  FAILS pre-fix.
- Shutdown_FlushesQueuedEventsBeforeClosingConnection: invariant guard -
  NLog drains the queue before CloseTarget, so nothing is sent over a
  closed connection. PASSES pre-fix (the audit's "log loss" premise is
  disproven; the reorder in the fix is defensive).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ose, bound the wait

EventHub and ServiceBus CloseTarget did:

    var task = Task.Run(() => _svc.CloseAsync());
    task.Wait(TimeSpan.FromMilliseconds(500));   // bool result discarded
    base.CloseTarget();

Reproduction (instrumented fakes + LogFactory) showed two real defects and
disproved a third:

1. Unobserved task exception: a CloseAsync that faults AFTER the 500ms wait
   times out becomes a silent unobserved TaskException - never logged. The
   discard also hides faults that complete within the window.
2. Abandoned teardown: the hard 500ms cap abandons a legitimately slow close
   mid-flight. For ServiceBus the abandoned step is _client.DisposeAsync(),
   leaking the undisposed client.
3. "Shutdown log loss" (the audit premise): NOT reproduced. NLog drains the
   queue (WriteAsyncTask) before CloseTarget runs, so events are never sent
   over a closed connection in any standard shutdown path.

Fix (symmetric across both targets):
- Call base.CloseTarget() FIRST so NLog drains any remaining queued events
  over the still-open connection, then close the connection.
- Keep the synchronous sync-over-async bridge but make it correct: a faulted
  close that completes within the wait is observed and logged via
  InternalLogger; on timeout a fault-observing continuation is attached so a
  later fault is never an unobserved task exception, and a warning is logged.
- CloseTarget never throws.
- Replace the arbitrary 500ms cap with a 5s bound generous enough for a
  healthy-but-slow teardown; because the outcome is always observed, the
  bound governs only how long shutdown blocks, never correctness.

Turns the regression tests added in the previous commit green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 23, 2026 08:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes shutdown teardown behavior for EventHubTarget and ServiceBusTarget by ensuring connection close is performed after NLog has drained queued events, and by ensuring CloseAsync() outcomes are always observed (avoiding unobserved task exceptions) while keeping shutdown blocking bounded.

Changes:

  • Reorders shutdown to base.CloseTarget() first, then performs a bounded wait for connection close, with async fault observation on timeout.
  • Adds a 5s close timeout and InternalLogger warnings for close failures/timeouts (without throwing from CloseTarget).
  • Adds new regression tests and instrumented fakes to validate flush-before-close ordering, slow close awaiting, and close-fault logging behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/NLog.Extensions.AzureEventHub/EventHubTarget.cs Reorders shutdown and adds bounded/observed connection close logic for EventHub.
src/NLog.Extensions.AzureServiceBus/ServiceBusTarget.cs Reorders shutdown and adds bounded/observed connection close logic for ServiceBus (including dispose completion).
test/NLog.Extensions.AzureEventHub.Tests/EventHubShutdownTests.cs New regression tests covering shutdown flush ordering and close fault/timeout behavior (EventHub).
test/NLog.Extensions.AzureEventHub.Tests/RecordingEventHubService.cs New instrumented IEventHubService fake to record send/close ordering and simulate slow/throwing close.
test/NLog.Extensions.AzureServiceBus.Tests/ServiceBusShutdownTests.cs New regression tests covering shutdown flush ordering and close fault/timeout behavior (ServiceBus).
test/NLog.Extensions.AzureServiceBus.Tests/RecordingCloudServiceBus.cs New instrumented ICloudServiceBus fake to record send/close ordering and model sender-close + client-dispose.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/NLog.Extensions.AzureEventHub.Tests/EventHubShutdownTests.cs
JDetmar and others added 2 commits June 23, 2026 21:56
Quality cleanup of the S5 fix (no behaviour change; all tests still pass):

- Hoist the duplicated ~32-line sync-over-async close logic out of both
  EventHubTarget and ServiceBusTarget into one shared, linked helper
  (AsyncTargetCloseHelper, alongside ProxyHelpers/SortHelpers). The subtle
  observe-on-timeout logic now lives once and can't drift between targets;
  each target's CloseTarget is a 3-line try/finally.
- Collapse the helper's nested try/catch + `bool completed` into a single
  try/catch: a close that faults within the wait and one that faults after
  the timeout are handled the same way, and the warning message + the 5s
  timeout are defined once instead of repeated per call site.
- Trim the recording test fakes: drop the Timeline/DumpTimeline/CloseStarted
  bookkeeping (and the unused partition-key plumbing) left over from the
  deleted discovery harness; the tests only assert the send/close counters.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The new EventHub/ServiceBus shutdown test classes mutate the global
NLog.InternalLogger.LogWriter/LogLevel but were not enrolled in the
existing [Collection("InternalLogger isolation")] that disables
parallelization for InternalLogger-mutating tests. Add the attribute so
they cannot run concurrently with sibling tests that read/write the same
global state, removing a potential source of suite flakiness.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

@JDetmar JDetmar merged commit 84fc28d into master Jun 24, 2026
22 checks passed
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants