Summary
We're seeing intermittent test hangs/timeouts when doctor is used via vacuum rule execution. Goroutines end up blocked in WaitForCompletion() waiting on sync.Cond.
From inspection of v0.0.49, both worker pools can miss a wakeup:
model/high/v3/walk_pool.go (WalkPool.worker)
model/high/v3/schema_pool.go (SchemaWalkPool.processItem)
Both decrement inFlight and call cond.Broadcast() without holding mu.
Reason (as far as I can tell)
WaitForCompletion() does:
- lock
mu
- check
inFlight > 0
- call
cond.Wait()
Workers do:
inFlight.Add(-1)
- if zero, call
cond.Broadcast() (currently unlocked)
Because the condition transitions and broadcast are not synchronized with the same lock used by waiters, there is a missed-wakeup window between the waiter's condition check and entering Wait(). If inFlight reaches zero and broadcast happens in that window, waiter can sleep indefinitely.
Trace
The stuck goroutines consistently include:
github.com/pb33f/doctor/model/high/v3.(*WalkPool).WaitForCompletion
github.com/pb33f/doctor/model/high/v3.(*DrContext).WaitForCompletion
github.com/pb33f/doctor/model/high/v3.(*Document).Walk
- caller waiting in
github.com/daveshanley/vacuum/motor.ApplyRulesToRuleSet
Proposed fix
Guard Broadcast() with mu in both pool implementations:
if p.inFlight.Add(-1) == 0 {
p.mu.Lock()
p.cond.Broadcast()
p.mu.Unlock()
}
Applied to:
model/high/v3/walk_pool.go
model/high/v3/schema_pool.go
Happy to open a PR with this change if you'd like.
Summary
We're seeing intermittent test hangs/timeouts when
doctoris used viavacuumrule execution. Goroutines end up blocked inWaitForCompletion()waiting onsync.Cond.From inspection of
v0.0.49, both worker pools can miss a wakeup:model/high/v3/walk_pool.go(WalkPool.worker)model/high/v3/schema_pool.go(SchemaWalkPool.processItem)Both decrement
inFlightand callcond.Broadcast()without holdingmu.Reason (as far as I can tell)
WaitForCompletion()does:muinFlight > 0cond.Wait()Workers do:
inFlight.Add(-1)cond.Broadcast()(currently unlocked)Because the condition transitions and broadcast are not synchronized with the same lock used by waiters, there is a missed-wakeup window between the waiter's condition check and entering
Wait(). IfinFlightreaches zero and broadcast happens in that window, waiter can sleep indefinitely.Trace
The stuck goroutines consistently include:
github.com/pb33f/doctor/model/high/v3.(*WalkPool).WaitForCompletiongithub.com/pb33f/doctor/model/high/v3.(*DrContext).WaitForCompletiongithub.com/pb33f/doctor/model/high/v3.(*Document).Walkgithub.com/daveshanley/vacuum/motor.ApplyRulesToRuleSetProposed fix
Guard
Broadcast()withmuin both pool implementations:Applied to:
model/high/v3/walk_pool.gomodel/high/v3/schema_pool.goHappy to open a PR with this change if you'd like.