LockResolver: unbounded goroutines and error accumulation cause OOM under sustained lock contention

## Summary

Under sustained lock contention (many locks across many regions, TiKV returning "server is busy"), `LockResolver` spawns unbounded concurrent goroutines that retry indefinitely, each accumulating error objects in `Backoffer.errors`. This creates a positive feedback loop — more retries cause more TiKV pressure, more "server is busy", more retries — eventually leading to TiDB OOM.

The core issue is that **there is no concurrency limit on lock resolution goroutines and no circuit breaker** to stop amplifying TiKV overload.

## Evidence

Profiling data from a TiDB pod during memory growth (OOM at ~15 GB RSS).

### Heap: 8.9 GB live, 95% in lock resolution backoff

```
      flat  flat%        cum   cum%
 5822.21MB 65.32%  5898.77MB 66.18%  fmt.Sprintf
 1365.81MB 15.32%  1365.81MB 15.32%  github.com/pkg/errors.callers
  346.16MB  3.88%     7513MB 84.29%  retry.(*Backoffer).BackoffWithCfgAndMaxSleep
```

5.8 GB in error strings + 1.4 GB in stack traces, both from [backoff.go L184](https://github.com/tikv/client-go/blob/master/config/retry/backoff.go#L184). 95% cumulative through `LockResolver.resolveLock`. 42% through `onServerIsBusy` -> `backoffOnNoCandidate` (all replicas busy, nowhere to send).

### CPU: 40% lock resolution, 20% protobuf serialization, 14% GC

20% of CPU wasted in [`RPCContext.String()`](https://github.com/tikv/client-go/blob/master/internal/locate/region_cache.go#L843) serializing `*metapb.Region` into error strings nobody reads. 14% on GC under memory pressure.

### Trace: 2,725 hours of goroutine-time blocked

85% of goroutine-time sleeping in backoff timers. 6:1 ratio of backoff sleep to actual RPC work. Massive concurrent goroutines confirmed by `.func2.1` suffix (anonymous closures in `resolveLocks`).

### RSS vs heap (15 GB vs 8.9 GB)

Expected: GOGC=100 allows heap up to 2x live set. 14% CPU on GC can't keep up, so dead objects accumulate.

## Root Cause

Four compounding issues:

**1. No concurrency limit on lock resolution goroutines.** `resolveLocks` spawns unbounded `go func()` at [L619](https://github.com/tikv/client-go/blob/master/txnkv/txnlock/lock_resolver.go#L619), [L1074](https://github.com/tikv/client-go/blob/master/txnkv/txnlock/lock_resolver.go#L1074), [L1129](https://github.com/tikv/client-go/blob/master/txnkv/txnlock/lock_resolver.go#L1129), [L1168](https://github.com/tikv/client-go/blob/master/txnkv/txnlock/lock_resolver.go#L1168). Each retries up to 40s ([asyncResolveLockMaxBackoff](https://github.com/tikv/client-go/blob/master/txnkv/txnlock/lock_resolver.go#L49)).

**2. Unbounded `Backoffer.errors` slice.** Every retry appends `errors.Errorf(...)` at [backoff.go L184](https://github.com/tikv/client-go/blob/master/config/retry/backoff.go#L184) — never truncated, only last 3 are ever read ([L156](https://github.com/tikv/client-go/blob/master/config/retry/backoff.go#L156)). `pkg/errors.Errorf` captures a full stack trace each time (1.4 GB). `Fork()`/`Clone()` copy the entire slice to children ([L267](https://github.com/tikv/client-go/blob/master/config/retry/backoff.go#L267), [L286](https://github.com/tikv/client-go/blob/master/config/retry/backoff.go#L286)).

**3. Protobuf serialization in error messages.** `errors.Errorf("server is busy, ctx: %v", ctx)` at [replica_selector.go L571](https://github.com/tikv/client-go/blob/master/internal/locate/replica_selector.go#L571) and ~20 other sites in [region_request.go](https://github.com/tikv/client-go/blob/master/internal/locate/region_request.go) and [lock_resolver.go](https://github.com/tikv/client-go/blob/master/txnkv/txnlock/lock_resolver.go) call `proto.CompactTextString()` on region metadata, producing large strings stored in the unbounded error slice.

**4. Feedback loop.** Unbounded goroutines -> TiKV overloaded -> "server is busy" -> more retries -> more memory -> GC pressure (14% CPU) -> slower -> more locks pile up -> OOM.

## Suggested Fixes

**Fix 1 (critical): Add a global semaphore to `LockResolver`** bounding concurrent resolution goroutines (e.g., 128). When full, async resolutions are shed (`select/default`), sync ones block. This breaks the avalanche by capping TiKV request rate and bounding memory.

**Fix 2: Cap `Backoffer.errors` to ~16 entries** and switch to `fmt.Errorf` (no stack traces). Stop copying the slice in `Fork()`/`Clone()`.

**Fix 3: Replace `%v` on `RPCContext`/`regionErr` with scalar fields** (region ID, addr) in all backoff error messages. Eliminates 20% CPU from proto serialization.

**Fix 4 (longer-term): Circuit breaker** — track "server is busy" ratio in `LockResolver`, pause spawning when overloaded.

| Fix | Effect |
|-----|--------|
| 1. Global semaphore | **Breaks the avalanche.** Caps memory and TiKV request rate. |
| 2. Cap errors + fmt.Errorf | Bounds per-goroutine memory. Eliminates 1.4 GB stack traces. |
| 3. Stop proto serialization | Eliminates 20% CPU waste. |
| 4. Circuit breaker | Self-healing under sustained pressure. |


## Appendix: CPU & Heap profile (`pprof` format)

[pbjsv.cpuprofile](https://github.com/user-attachments/files/26426799/pbjsv.cpuprofile)

[pbjsv_heap.cpuprofile](https://github.com/user-attachments/files/26426800/pbjsv_heap.cpuprofile)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LockResolver: unbounded goroutines and error accumulation cause OOM under sustained lock contention #1937

Summary

Evidence

Heap: 8.9 GB live, 95% in lock resolution backoff

CPU: 40% lock resolution, 20% protobuf serialization, 14% GC

Trace: 2,725 hours of goroutine-time blocked

RSS vs heap (15 GB vs 8.9 GB)

Root Cause

Suggested Fixes

Appendix: CPU & Heap profile (`pprof` format)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fix	Effect
1. Global semaphore	Breaks the avalanche. Caps memory and TiKV request rate.
2. Cap errors + fmt.Errorf	Bounds per-goroutine memory. Eliminates 1.4 GB stack traces.
3. Stop proto serialization	Eliminates 20% CPU waste.
4. Circuit breaker	Self-healing under sustained pressure.

LockResolver: unbounded goroutines and error accumulation cause OOM under sustained lock contention #1937

Description

Summary

Evidence

Heap: 8.9 GB live, 95% in lock resolution backoff

CPU: 40% lock resolution, 20% protobuf serialization, 14% GC

Trace: 2,725 hours of goroutine-time blocked

RSS vs heap (15 GB vs 8.9 GB)

Root Cause

Suggested Fixes

Appendix: CPU & Heap profile (pprof format)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Appendix: CPU & Heap profile (`pprof` format)