Summary
Under sustained lock contention (many locks across many regions, TiKV returning "server is busy"), LockResolver spawns unbounded concurrent goroutines that retry indefinitely, each accumulating error objects in Backoffer.errors. This creates a positive feedback loop — more retries cause more TiKV pressure, more "server is busy", more retries — eventually leading to TiDB OOM.
The core issue is that there is no concurrency limit on lock resolution goroutines and no circuit breaker to stop amplifying TiKV overload.
Evidence
Profiling data from a TiDB pod during memory growth (OOM at ~15 GB RSS).
Heap: 8.9 GB live, 95% in lock resolution backoff
flat flat% cum cum%
5822.21MB 65.32% 5898.77MB 66.18% fmt.Sprintf
1365.81MB 15.32% 1365.81MB 15.32% github.com/pkg/errors.callers
346.16MB 3.88% 7513MB 84.29% retry.(*Backoffer).BackoffWithCfgAndMaxSleep
5.8 GB in error strings + 1.4 GB in stack traces, both from backoff.go L184. 95% cumulative through LockResolver.resolveLock. 42% through onServerIsBusy -> backoffOnNoCandidate (all replicas busy, nowhere to send).
CPU: 40% lock resolution, 20% protobuf serialization, 14% GC
20% of CPU wasted in RPCContext.String() serializing *metapb.Region into error strings nobody reads. 14% on GC under memory pressure.
Trace: 2,725 hours of goroutine-time blocked
85% of goroutine-time sleeping in backoff timers. 6:1 ratio of backoff sleep to actual RPC work. Massive concurrent goroutines confirmed by .func2.1 suffix (anonymous closures in resolveLocks).
RSS vs heap (15 GB vs 8.9 GB)
Expected: GOGC=100 allows heap up to 2x live set. 14% CPU on GC can't keep up, so dead objects accumulate.
Root Cause
Four compounding issues:
1. No concurrency limit on lock resolution goroutines. resolveLocks spawns unbounded go func() at L619, L1074, L1129, L1168. Each retries up to 40s (asyncResolveLockMaxBackoff).
2. Unbounded Backoffer.errors slice. Every retry appends errors.Errorf(...) at backoff.go L184 — never truncated, only last 3 are ever read (L156). pkg/errors.Errorf captures a full stack trace each time (1.4 GB). Fork()/Clone() copy the entire slice to children (L267, L286).
3. Protobuf serialization in error messages. errors.Errorf("server is busy, ctx: %v", ctx) at replica_selector.go L571 and ~20 other sites in region_request.go and lock_resolver.go call proto.CompactTextString() on region metadata, producing large strings stored in the unbounded error slice.
4. Feedback loop. Unbounded goroutines -> TiKV overloaded -> "server is busy" -> more retries -> more memory -> GC pressure (14% CPU) -> slower -> more locks pile up -> OOM.
Suggested Fixes
Fix 1 (critical): Add a global semaphore to LockResolver bounding concurrent resolution goroutines (e.g., 128). When full, async resolutions are shed (select/default), sync ones block. This breaks the avalanche by capping TiKV request rate and bounding memory.
Fix 2: Cap Backoffer.errors to ~16 entries and switch to fmt.Errorf (no stack traces). Stop copying the slice in Fork()/Clone().
Fix 3: Replace %v on RPCContext/regionErr with scalar fields (region ID, addr) in all backoff error messages. Eliminates 20% CPU from proto serialization.
Fix 4 (longer-term): Circuit breaker — track "server is busy" ratio in LockResolver, pause spawning when overloaded.
| Fix |
Effect |
| 1. Global semaphore |
Breaks the avalanche. Caps memory and TiKV request rate. |
| 2. Cap errors + fmt.Errorf |
Bounds per-goroutine memory. Eliminates 1.4 GB stack traces. |
| 3. Stop proto serialization |
Eliminates 20% CPU waste. |
| 4. Circuit breaker |
Self-healing under sustained pressure. |
Appendix: CPU & Heap profile (pprof format)
pbjsv.cpuprofile
pbjsv_heap.cpuprofile
Summary
Under sustained lock contention (many locks across many regions, TiKV returning "server is busy"),
LockResolverspawns unbounded concurrent goroutines that retry indefinitely, each accumulating error objects inBackoffer.errors. This creates a positive feedback loop — more retries cause more TiKV pressure, more "server is busy", more retries — eventually leading to TiDB OOM.The core issue is that there is no concurrency limit on lock resolution goroutines and no circuit breaker to stop amplifying TiKV overload.
Evidence
Profiling data from a TiDB pod during memory growth (OOM at ~15 GB RSS).
Heap: 8.9 GB live, 95% in lock resolution backoff
5.8 GB in error strings + 1.4 GB in stack traces, both from backoff.go L184. 95% cumulative through
LockResolver.resolveLock. 42% throughonServerIsBusy->backoffOnNoCandidate(all replicas busy, nowhere to send).CPU: 40% lock resolution, 20% protobuf serialization, 14% GC
20% of CPU wasted in
RPCContext.String()serializing*metapb.Regioninto error strings nobody reads. 14% on GC under memory pressure.Trace: 2,725 hours of goroutine-time blocked
85% of goroutine-time sleeping in backoff timers. 6:1 ratio of backoff sleep to actual RPC work. Massive concurrent goroutines confirmed by
.func2.1suffix (anonymous closures inresolveLocks).RSS vs heap (15 GB vs 8.9 GB)
Expected: GOGC=100 allows heap up to 2x live set. 14% CPU on GC can't keep up, so dead objects accumulate.
Root Cause
Four compounding issues:
1. No concurrency limit on lock resolution goroutines.
resolveLocksspawns unboundedgo func()at L619, L1074, L1129, L1168. Each retries up to 40s (asyncResolveLockMaxBackoff).2. Unbounded
Backoffer.errorsslice. Every retry appendserrors.Errorf(...)at backoff.go L184 — never truncated, only last 3 are ever read (L156).pkg/errors.Errorfcaptures a full stack trace each time (1.4 GB).Fork()/Clone()copy the entire slice to children (L267, L286).3. Protobuf serialization in error messages.
errors.Errorf("server is busy, ctx: %v", ctx)at replica_selector.go L571 and ~20 other sites in region_request.go and lock_resolver.go callproto.CompactTextString()on region metadata, producing large strings stored in the unbounded error slice.4. Feedback loop. Unbounded goroutines -> TiKV overloaded -> "server is busy" -> more retries -> more memory -> GC pressure (14% CPU) -> slower -> more locks pile up -> OOM.
Suggested Fixes
Fix 1 (critical): Add a global semaphore to
LockResolverbounding concurrent resolution goroutines (e.g., 128). When full, async resolutions are shed (select/default), sync ones block. This breaks the avalanche by capping TiKV request rate and bounding memory.Fix 2: Cap
Backoffer.errorsto ~16 entries and switch tofmt.Errorf(no stack traces). Stop copying the slice inFork()/Clone().Fix 3: Replace
%vonRPCContext/regionErrwith scalar fields (region ID, addr) in all backoff error messages. Eliminates 20% CPU from proto serialization.Fix 4 (longer-term): Circuit breaker — track "server is busy" ratio in
LockResolver, pause spawning when overloaded.Appendix: CPU & Heap profile (
pprofformat)pbjsv.cpuprofile
pbjsv_heap.cpuprofile