Skip to content

ZOOKEEPER-2789: Reassign ZXID for solving 32bit overflow problem(base master and fix conflict)#2382

Open
rooookkie wants to merge 3 commits into
apache:masterfrom
rooookkie:ZOOKEEPER-2789-zxid-overflow
Open

ZOOKEEPER-2789: Reassign ZXID for solving 32bit overflow problem(base master and fix conflict)#2382
rooookkie wants to merge 3 commits into
apache:masterfrom
rooookkie:ZOOKEEPER-2789-zxid-overflow

Conversation

@rooookkie

Copy link
Copy Markdown

This PR addresses ZOOKEEPER-2789, which resolves the ZXID 32-bit counter overflow problem.

In ZooKeeper, the ZXID is a 64-bit number composed of a 32-bit epoch and a 32-bit counter. When the 32-bit counter overflows, it forces a leader re-election and can cause serious issues in the cluster. This change:

  1. Extends the counter bit width: Extends ZxidUtils to support configurable epoch/counter bit positions, enabling a 40-bit counter (from the original 32-bit), which significantly delays the overflow threshold.
  2. Supports smooth rolling upgrade: Adds upgrade coordination logic in QuorumPeer and Learner, allowing the cluster to transition from 32-bit to 40-bit counter mode without restart via rolling upgrade.
  3. Replaces hardcoded bit operations: Replaces all hardcoded ZXID bit operations (& 0xffffffffL, >> 32, << 32) with ZxidUtils utility methods across the codebase, making the code more maintainable and consistent.

Brief changelog

  • Extended ZxidUtils with configurable epoch high position (32-bit / 40-bit), clearEpoch(), clearCounter(), and getCounterLowPosition() methods
  • Added smooth upgrade support in QuorumPeer and Learner for transitioning from 32-bit to 40-bit counter
  • Replaced all hardcoded ZXID bit operations with ZxidUtils methods across Leader, LearnerHandler, FollowerZooKeeperServer, and ObserverMaster
  • Updated Leader.propose() to use ZxidUtils.getCounterLowPosition() for rollover detection
  • Added unit tests in LearnerHandlerTest for ZXID reassignment scenarios
  • Updated Zab1_0Test, ZxidRolloverTest, FollowerResyncConcurrencyTest, and ReconfigTest to use ZxidUtils methods

How does this change relate to the original PR?

This is a re-submission of #2164. The original authors (@asdf2014 and @ganzichen) have been inactive for a long time and the PR has gone stale, so I'm picking it up and rebasing it on the latest master branch.

Conflict resolution:

  • The buildRequestToProcess method in FollowerZooKeeperServer was removed upstream by ZOOKEEPER-4925, so the conflict was resolved by dropping that method and applying the bit-operation replacement directly to the current logRequest() method.
  • Additionally replaced bit operations in ObserverMaster.processAck() which was not covered by the original PR.

Testing done

  • Unit tests added in LearnerHandlerTest for ZXID reassignment
  • Existing ZxidRolloverTest, Zab1_0Test, FollowerResyncConcurrencyTest, and ReconfigTest updated
  • All existing tests pass

Original Authors

Credit to the original authors of #2164:

@anmolnar anmolnar requested a review from kezhuw May 7, 2026 18:56
@maoling

maoling commented Jun 25, 2026

Copy link
Copy Markdown
Member
  1. Have you performed any battle-tested upgrade validation from an old 32/32 cluster to a new 24/40 cluster?
  2. The existing transaction logs and snapshots are still persisted using the 32/32 zxid format. Do these files need to be rewritten into the 24/40 zxid format during/before the upgrade? Otherwise, the system may end up with mixed zxid formats, where is the consistency?

An interesting brainstorming idea (feel free to ignore this):

We could introduce a zxid_v2 to replace the current 64-bit zxid. Instead of packing epoch and counter into a single long, zxid_v2 could explicitly store them as separate fields while remaining globally comparable and monotonically increasing.

class ZxidV2 {
    long epoch;
    long counter;
}

Conceptually, zxid_v2 would behave like a 128-bit identifier with lexicographical ordering:

  • Compare epoch first.
  • If epoch is equal, compare counter.
  • Guarantees global ordering and monotonic increase.
  • Removes the bit-allocation constraints imposed by the current zxid encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants