Skip to content

Potential deadlock in WalkPool/SchemaWalkPool WaitForCompletion (missed wakeup) #85

@califlower

Description

@califlower

Summary

We're seeing intermittent test hangs/timeouts when doctor is used via vacuum rule execution. Goroutines end up blocked in WaitForCompletion() waiting on sync.Cond.

From inspection of v0.0.49, both worker pools can miss a wakeup:

  • model/high/v3/walk_pool.go (WalkPool.worker)
  • model/high/v3/schema_pool.go (SchemaWalkPool.processItem)

Both decrement inFlight and call cond.Broadcast() without holding mu.

Reason (as far as I can tell)

WaitForCompletion() does:

  1. lock mu
  2. check inFlight > 0
  3. call cond.Wait()

Workers do:

  1. inFlight.Add(-1)
  2. if zero, call cond.Broadcast() (currently unlocked)

Because the condition transitions and broadcast are not synchronized with the same lock used by waiters, there is a missed-wakeup window between the waiter's condition check and entering Wait(). If inFlight reaches zero and broadcast happens in that window, waiter can sleep indefinitely.

Trace

The stuck goroutines consistently include:

  • github.com/pb33f/doctor/model/high/v3.(*WalkPool).WaitForCompletion
  • github.com/pb33f/doctor/model/high/v3.(*DrContext).WaitForCompletion
  • github.com/pb33f/doctor/model/high/v3.(*Document).Walk
  • caller waiting in github.com/daveshanley/vacuum/motor.ApplyRulesToRuleSet

Proposed fix

Guard Broadcast() with mu in both pool implementations:

if p.inFlight.Add(-1) == 0 {
    p.mu.Lock()
    p.cond.Broadcast()
    p.mu.Unlock()
}

Applied to:

  • model/high/v3/walk_pool.go
  • model/high/v3/schema_pool.go

Happy to open a PR with this change if you'd like.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions