Skip to content

feat(api) start blob sub from height instead of tip of the chain#4864

Draft
ninabarbakadze wants to merge 8 commits intocelestiaorg:mainfrom
ninabarbakadze:nina/subscribe-with-height
Draft

feat(api) start blob sub from height instead of tip of the chain#4864
ninabarbakadze wants to merge 8 commits intocelestiaorg:mainfrom
ninabarbakadze:nina/subscribe-with-height

Conversation

@ninabarbakadze
Copy link
Copy Markdown
Member

@ninabarbakadze ninabarbakadze commented Mar 18, 2026

@github-actions github-actions bot added the external Issues created by non node team members label Mar 18, 2026
@ninabarbakadze ninabarbakadze added the area:api Related to celestia-node API label Mar 18, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 60.17699% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 35.42%. Comparing base (2469e7a) to head (c8f9bcf).
⚠️ Report is 710 commits behind head on main.

Files with missing lines Patch % Lines
blob/service.go 66.66% 26 Missing and 8 partials ⚠️
nodebuilder/blob/mocks/api.go 0.00% 9 Missing ⚠️
nodebuilder/blob/blob.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4864      +/-   ##
==========================================
- Coverage   44.83%   35.42%   -9.42%     
==========================================
  Files         265      313      +48     
  Lines       14620    21791    +7171     
==========================================
+ Hits         6555     7719    +1164     
- Misses       7313    13058    +5745     
- Partials      752     1014     +262     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

blob/service.go Outdated
// During the live phase, retrieval errors are retried until successful.
// The channel will be closed when the context is canceled or the service is stopped.
// Not reading from the returned channel will cause the stream to close after 16 messages.
func (s *Service) SubscribeWithStartHeight(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubsribeFrom

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still in draft sir pls

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't mean i cant comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see, another comment

@ninabarbakadze ninabarbakadze self-assigned this Mar 19, 2026
blob/service.go Outdated
"startHeight", startHeight,
)

blobCh := make(chan *SubscriptionResponse, 16)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to increase this buffer size as the catchup blobs will arrive much faster than live subscription. I'll defer to the reviewer on this.

@ninabarbakadze ninabarbakadze marked this pull request as ready for review March 19, 2026 12:10
@ninabarbakadze ninabarbakadze requested a review from a team as a code owner March 19, 2026 12:10
@ninabarbakadze ninabarbakadze requested a review from rootulp March 19, 2026 12:10
devin-ai-integration[bot]

This comment was marked as resolved.

@rootulp rootulp removed their request for review March 19, 2026 14:32
Copy link
Copy Markdown
Member

@walldiss walldiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work overall — the catchup→live design is solid and the code is well-structured. A few issues to address, one of which is a bug in the gap-fill path.

if err != nil {
log.Errorw("blobsub: failed to fill gap between catchup and live",
"namespace", ns.ID(), "height", expectedNextHeight, "err", err)
return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: gap-fill treats all headerGetter errors as fatal, but "syncing in progress" is transient.

Scenario:

  1. Catchup processes heights 1..100, height 101 returns "syncing in progress" (store head is 100, syncer head is 200)
  2. isBeyondTip returns true, catchup ends with expectedNextHeight = 101
  3. headerSub starts, first live header arrives at height 200
  4. Gap-fill calls headerGetter(ctx, 101) — store still hasn't synced to 101
  5. Returns "syncing in progress" again → subscription permanently closed

This needs a retry with backoff when isBeyondTip(err) is true, similar to:

gapHeader, err := s.headerGetter(ctx, expectedNextHeight)
if err != nil {
    if isBeyondTip(err) {
        // store hasn't caught up yet, wait and retry
        select {
        case <-ctx.Done(): return
        case <-time.After(time.Second): continue
        }
    }
    // ...
}

Alternatively, consider using a WaitForHeight-style call that blocks until the header is available.

return
case <-s.ctx.Done():
return
case blobCh <- &SubscriptionResponse{Blobs: blobs, Height: header.Height(), Header: &header.RawHeader}:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No overflow check during catchup sends — inconsistent with live phase.

During live streaming (line 263), the subscription is closed when the buffer is full. But here during catchup, the send is blocking — a slow reader will stall the goroutine indefinitely rather than closing.

This might be intentional (catchup data is already available, so blocking is reasonable), but the behavior difference should be documented in the godoc. Currently the doc says "Not reading from the returned channel during live subscription will cause the stream to close after 16 messages" — add a note that during catchup, sends are blocking.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was intentional because catchup will be much faster than live subscription so I'll document it.

// During the live phase, retrieval errors are retried until successful.
// The channel will be closed when the context is canceled or the service is stopped.
// Not reading from the returned channel during live subscription will cause the stream
// to close after 16 messages.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Godoc should document recovery pattern.

The docs explain the overflow behavior but not how users should handle it. Since SubscriptionResponse includes Height, clients can track the last received height and re-subscribe from lastHeight + 1. Worth adding a note like:

// If the channel is closed due to buffer overflow, clients can re-subscribe
// from the last received height + 1 to resume without gaps.

// height is beyond the current network or local head.
func isBeyondTip(err error) bool {
msg := err.Error()
return strings.Contains(msg, "from the future") || strings.Contains(msg, "syncing in progress")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile string matching — will silently break if error messages change.

This is coupled to specific error strings in nodebuilder/header/service.go. If those messages are ever changed, this breaks silently and catchup will treat "beyond tip" as a fatal error.

Ideal fix: export sentinel errors or error types from the header package that can be matched with errors.Is/errors.As.

Minimal fix: add a comment noting the coupling, e.g.:

// NOTE: coupled to error messages in nodebuilder/header/service.go:107,132

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! I was going to ask if we wanted to ad an error type directly in the header package(IIRC this is where headerGetter lives)

) bool {
var blobs []*Blob
var err error
for {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tight infinite retry loop with no backoff or logging.

If getAll consistently fails (e.g. network partition), this spins at 100% CPU with no visibility. Consider:

  1. Logging the error (at least at Debug level)
  2. Adding a short backoff (time.After(100ms) or exponential)

This is pre-existing behavior from the original Subscribe, but worth fixing now that it's extracted into a shared helper.

"namespace", ns.ID(), "height", expectedNextHeight, "err", err)
return
}
if !s.fetchAndSendBlobs(ctx, blobCh, gapHeader, ns) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gap-fill sends via fetchAndSendBlobs bypass the buffer overflow check.

The overflow check at line 263 runs once per live header, but this loop can emit multiple gap-fill responses before reaching it. If the gap is large and the reader is slow, the buffer could fill beyond the check. Consider adding an overflow check inside this loop as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overflow check being that we disconnect if buffer is full? I wouldn't disconnect during gap-fill because similar to catch-up the buffer might fill up quickly because of fast writes and slower reads.

@ninabarbakadze ninabarbakadze marked this pull request as draft March 25, 2026 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:api Related to celestia-node API external Issues created by non node team members

Projects

None yet

Development

Successfully merging this pull request may close these issues.

api: start blob sub from height instead of tip of the chain

5 participants