Skip to content

LockResolver: unbounded goroutines and error accumulation cause OOM under sustained lock contention #1937

@AmoebaProtozoa

Description

@AmoebaProtozoa

Summary

Under sustained lock contention (many locks across many regions, TiKV returning "server is busy"), LockResolver spawns unbounded concurrent goroutines that retry indefinitely, each accumulating error objects in Backoffer.errors. This creates a positive feedback loop — more retries cause more TiKV pressure, more "server is busy", more retries — eventually leading to TiDB OOM.

The core issue is that there is no concurrency limit on lock resolution goroutines and no circuit breaker to stop amplifying TiKV overload.

Evidence

Profiling data from a TiDB pod during memory growth (OOM at ~15 GB RSS).

Heap: 8.9 GB live, 95% in lock resolution backoff

      flat  flat%        cum   cum%
 5822.21MB 65.32%  5898.77MB 66.18%  fmt.Sprintf
 1365.81MB 15.32%  1365.81MB 15.32%  github.com/pkg/errors.callers
  346.16MB  3.88%     7513MB 84.29%  retry.(*Backoffer).BackoffWithCfgAndMaxSleep

5.8 GB in error strings + 1.4 GB in stack traces, both from backoff.go L184. 95% cumulative through LockResolver.resolveLock. 42% through onServerIsBusy -> backoffOnNoCandidate (all replicas busy, nowhere to send).

CPU: 40% lock resolution, 20% protobuf serialization, 14% GC

20% of CPU wasted in RPCContext.String() serializing *metapb.Region into error strings nobody reads. 14% on GC under memory pressure.

Trace: 2,725 hours of goroutine-time blocked

85% of goroutine-time sleeping in backoff timers. 6:1 ratio of backoff sleep to actual RPC work. Massive concurrent goroutines confirmed by .func2.1 suffix (anonymous closures in resolveLocks).

RSS vs heap (15 GB vs 8.9 GB)

Expected: GOGC=100 allows heap up to 2x live set. 14% CPU on GC can't keep up, so dead objects accumulate.

Root Cause

Four compounding issues:

1. No concurrency limit on lock resolution goroutines. resolveLocks spawns unbounded go func() at L619, L1074, L1129, L1168. Each retries up to 40s (asyncResolveLockMaxBackoff).

2. Unbounded Backoffer.errors slice. Every retry appends errors.Errorf(...) at backoff.go L184 — never truncated, only last 3 are ever read (L156). pkg/errors.Errorf captures a full stack trace each time (1.4 GB). Fork()/Clone() copy the entire slice to children (L267, L286).

3. Protobuf serialization in error messages. errors.Errorf("server is busy, ctx: %v", ctx) at replica_selector.go L571 and ~20 other sites in region_request.go and lock_resolver.go call proto.CompactTextString() on region metadata, producing large strings stored in the unbounded error slice.

4. Feedback loop. Unbounded goroutines -> TiKV overloaded -> "server is busy" -> more retries -> more memory -> GC pressure (14% CPU) -> slower -> more locks pile up -> OOM.

Suggested Fixes

Fix 1 (critical): Add a global semaphore to LockResolver bounding concurrent resolution goroutines (e.g., 128). When full, async resolutions are shed (select/default), sync ones block. This breaks the avalanche by capping TiKV request rate and bounding memory.

Fix 2: Cap Backoffer.errors to ~16 entries and switch to fmt.Errorf (no stack traces). Stop copying the slice in Fork()/Clone().

Fix 3: Replace %v on RPCContext/regionErr with scalar fields (region ID, addr) in all backoff error messages. Eliminates 20% CPU from proto serialization.

Fix 4 (longer-term): Circuit breaker — track "server is busy" ratio in LockResolver, pause spawning when overloaded.

Fix Effect
1. Global semaphore Breaks the avalanche. Caps memory and TiKV request rate.
2. Cap errors + fmt.Errorf Bounds per-goroutine memory. Eliminates 1.4 GB stack traces.
3. Stop proto serialization Eliminates 20% CPU waste.
4. Circuit breaker Self-healing under sustained pressure.

Appendix: CPU & Heap profile (pprof format)

pbjsv.cpuprofile

pbjsv_heap.cpuprofile

Metadata

Metadata

Assignees

No one assigned

    Labels

    contributionThis PR is from a community contributor.first-time-contributorIndicates that the PR was contributed by an external member and is a first-time contributor.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions