Skip to content

paymentsdb: restore sql payment store parity with kv#10721

Merged
yyforyongyu merged 9 commits intolightningnetwork:masterfrom
yyforyongyu:fix-paymentdb
Apr 9, 2026
Merged

paymentsdb: restore sql payment store parity with kv#10721
yyforyongyu merged 9 commits intolightningnetwork:masterfrom
yyforyongyu:fix-paymentdb

Conversation

@yyforyongyu
Copy link
Copy Markdown
Member

Issues Found

  • SQL stored the payment identifier instead of the per-attempt hash for AMP
    HTLC attempts, so AMP attempts could reload with the wrong hash after a
    restart.
  • SQL startup recovery only considered unresolved attempts, which missed some
    non-terminal payments that KV already exposed through
    FetchInFlightPayments.
  • The non-terminal payment selector also needed a more efficient query shape so
    the correctness fix would not devolve into a full history scan.

What This PR Does

  • persist the correct AMP attempt hash in the SQL store
  • fetch non-terminal payments directly for SQL startup recovery, matching the
    current KV behavior
  • use a UNION-based selector to improve the non-terminal payment query
  • remove the old inflight-only SQL query path and its dead helper code

Testing

  • make lint
  • GOWORK=off go test ./payments/db ./routing
  • GOWORK=off go test -tags=test_db_sqlite ./payments/db
  • GOWORK=off go test -tags=test_db_postgres ./payments/db
    • blocked in the default environment without Docker socket access

SQL writes store the payment identifier in payment_hash for each
attempt. That is wrong for AMP payments, because the payment
identifier is the SetID while each shard carries its own HTLC hash.

Add TestRegisterAttemptPreservesAttemptHash as evidence. It passes on
KV and fails on SQL before the fix because SQL reads the attempt hash
back as the payment identifier.
Live SQL writes stored the payment identifier in payment_hash for
each attempt. That works for legacy payments, but it breaks AMP
because the payment identifier is the SetID while each shard carries
its own HTLC hash.

Use TestRegisterAttemptPreservesAttemptHash as evidence. The test now
passes on both KV and SQL.

Fix this by persisting attempt.Hash when it is present and only
falling back to the payment identifier when the attempt hash is nil.
That restores KV parity for AMP attempt reloads.
SQL FetchInFlightPayments only returns payments with an unresolved
attempt row. KV returns every non-terminal payment, including
retryable payments with only failed attempts and payments that have
been initialized but have not registered any HTLCs yet.

Add TestFetchInFlightPaymentsIncludesRetryablePayments and
TestFetchInFlightPaymentsIncludesInitiatedPayments as evidence. Both
tests pass on KV and fail on SQL before the fix.
FetchInFlightPayments needs a dedicated SQL query that can return
non-terminal payments without relying on the unresolved-attempt scan.
The first version of that query fixed correctness, but the follow-up
selector measurements showed a UNION-based shape was materially
faster while returning the same payment set.

Use the inflight regression tests as evidence. The tests still fail on
SQL before the Go payment store is wired up, but this commit adds the
final SQL surface the later wiring commit depends on.

Add FetchNonTerminalPayments to the SQL query set, regenerate the sqlc
bindings, add the PaymentAndIntent adapters for the new row type, and
use the UNION-based candidate selection so the final query shape lands
in one commit.
The new FetchNonTerminalPayments query is available, but
FetchInFlightPayments still uses the old unresolved-attempt scan until
this commit.

Use the inflight recovery regression tests as evidence. They now pass
on both KV and SQL once the payment store is wired up to use the new
query.

Fix this by switching FetchInFlightPayments to the non-terminal
payment query and batch loading only the related attempt and route
data for those payment IDs.
FetchInFlightPayments no longer relies on the old
FetchAllInflightAttempts query surface once the non-terminal payment
query is in place.

Use make lint and the inflight recovery tests as evidence. The code
still passes once the old query and its generated bindings are removed.

Remove FetchAllInflightAttempts from payments.sql, regenerate the sqlc
bindings, and drop the matching SQLQueries interface method.
After removing the old FetchAllInflightAttempts query API, the helper
types that only existed to batch load that path are no longer used.

Use make lint as evidence. It reports the old inflight helper types as
unused once the query API is gone.

Remove the obsolete helper structs and batch-loading function from the
payment SQL store so the remaining code matches the new inflight
recovery path.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the SQL payment store's parity with the KV store by addressing issues related to AMP payment hash persistence and startup recovery. It introduces a more efficient query mechanism for identifying non-terminal payments and cleans up legacy code paths that were previously used for inflight attempt tracking.

Highlights

  • AMP Payment Hash Persistence: Updated the SQL store to correctly persist the individual HTLC attempt hash instead of the payment identifier, ensuring AMP payments reload with the correct hash after a restart.
  • Improved Startup Recovery: Replaced the inflight-only query path with a more robust FetchNonTerminalPayments query, ensuring all non-terminal payments are correctly exposed during startup recovery.
  • Query Optimization: Implemented a UNION-based selector for non-terminal payments to improve query efficiency and prevent full history scans.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the payment fetching logic in the SQL store to use a more efficient pagination approach based on non-terminal payments rather than individual HTLC attempts. It introduces a new FetchNonTerminalPayments query that correctly identifies payments that are still in-flight, initiated, or have retryable failures. Additionally, it ensures that HTLC attempt hashes are preserved independently of the payment identifier, which is critical for AMP payments. New unit tests have been added to verify these behaviors. I have no feedback to provide as the identified issue in the review was actually incorrect; rows.Close() is not redundant in this context as it is good practice to close rows explicitly to handle potential errors during the closing process, especially when not using a defer block.

@yyforyongyu yyforyongyu marked this pull request as ready for review April 7, 2026 16:16
@saubyk saubyk requested a review from ziggie1984 April 7, 2026 16:41
@saubyk saubyk added this to v0.21 Apr 7, 2026
@saubyk saubyk moved this to In progress in v0.21 Apr 7, 2026
sessionKey := attempt.SessionKey()
sessionKeyBytes := sessionKey.Serialize()
attemptHash := paymentHash[:]
if attempt.Hash != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we log an error in case the hash is nil normally this should never happen or ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I added 412db8ae7 to log the nil-hash fallback path so we get a diagnostic if a newly registered attempt ever reaches it unexpectedly. The legacy fallback behavior itself stays the same.

Copy link
Copy Markdown
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you for the fix. The AMP fix but also the Inflight fix are not super severe since for the hash it is never used for critical operations in the paymentsDB. Moreover the fetching of payments which do not have an unresolved HTLC are also very rare and it would them only leave in an INFLIGHT state but without any potential loss of funds.

SessionKey: sessionKeyBytes,
AttemptTime: attempt.AttemptTime,
PaymentHash: paymentHash[:],
PaymentHash: attemptHash,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that while the stored hash was wrong for AMP payments, this did not cause observable payment failures. Settlement is driven by the preimage arriving on the wire, and
failure decryption relies solely on the session key and payment path — neither path consults the stored attempt hash. The bug was a data correctness issue rather than a live payment
breakage, which also explains why it went undetected.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I left the runtime behavior unchanged and treated this as a data-correctness fix rather than a live payment breakage. I only followed up with 412db8ae7 to add a diagnostic on the nil-hash fallback path.

@ziggie1984
Copy link
Copy Markdown
Collaborator

ziggie1984 commented Apr 7, 2026

Note: This review was generated by a Claude Code AI review session (/code-review), based on deep static analysis of the diff, schema, index definitions, and Go call paths. No automated tooling was substituted for manual reasoning — all conclusions were derived by tracing the code.


Review: paymentsdb: restore sql payment store parity with kv

Verdict: Approve with minor suggestions.

What was verified

  • All new regression tests pass on both the KV and SQLite backends
  • The three UNION branches are mutually exclusive across payments and correctly map to all three non-terminal states in decidePaymentStatus
  • ORDER BY p.id ASC correctly replaces the old sort.Slice by SequenceNum since p.id == SequenceNum (sql_store.go:636)
  • The AMP hash fix is correct; attempt.Hash is never nil in live router code — only in KV→SQL migration of legacy attempts, which is handled separately in migration1/sql_migration.go:434

Impact of the AMP hash bug

Worth noting that while the stored hash was wrong for AMP payments, this did not cause observable payment failures. Settlement is driven by the preimage arriving on the wire, and failure decryption relies solely on the session key and payment path — neither path consults the stored attempt hash. The bug was a data correctness issue rather than a live payment breakage, which also explains why it went undetected.

Suggestions (non-blocking)

1. Silent nil-hash fallback should log an error

In sql_store.go:1477–1480, the nil branch falls back silently with no diagnostic output. Since this path should never be reached in live router code, hitting it indicates a programming error. A log.Errorf would surface it immediately rather than silently persisting wrong data:

attemptHash := paymentHash[:]
if attempt.Hash != nil {
    attemptHash = attempt.Hash[:]
} else {
    log.Errorf("RegisterAttempt: attempt %d has nil hash, falling "+
        "back to payment identifier %v", attempt.AttemptID, paymentHash)
}

2. Branch 1 lacks DISTINCT while Branch 3 has it

Minor clarity nit: Branch 1 returns one row per unresolved attempt, so a payment with multiple in-flight shards produces duplicate payment_id values. The outer UNION deduplicates them
correctly, but Branch 3 already uses SELECT DISTINCT for the same reason. Making Branch 1 consistent and switching to UNION ALL makes the invariant self-documenting — each branch is
responsible for its own deduplication, and UNION ALL signals that the caller knows the branches can never overlap:

   SELECT DISTINCT ha.payment_id AS id   -- add DISTINCT
   FROM payment_htlc_attempts ha
   WHERE NOT EXISTS (...)

  UNION ALL                              -- change from UNION

   SELECT p.id
   FROM payments p
   WHERE ...

  UNION ALL                              -- change from UNION

   SELECT DISTINCT ha.payment_id AS id   -- already present
   FROM payment_htlc_attempts ha
   ...

No behaviour change, no performance impact — purely a readability improvement.

3. Docstring is inaccurate after the fix

FetchInFlightPayments and the PaymentReader interface both still describe the function as returning "payments with status InFlight". The function now also returns StatusInitiated payments (Branch 2) and retryable payments. The description should say "non-terminal payments" and enumerate all three cases.

4. Branch 2 full-table scan (minor, startup-only)

payments WHERE fail_reason IS NULL has no index. A partial index would help on large nodes:

CREATE INDEX IF NOT EXISTS idx_payments_no_fail_reason
ON payments(id) WHERE fail_reason IS NULL;

Both SQLite and PostgreSQL support partial indexes. Acceptable as a follow-up since this is startup-only.

@ziggie1984 ziggie1984 moved this from In progress to In review in v0.21 Apr 8, 2026
RegisterAttempt falls back to the payment identifier when an attempt
hash is nil so legacy data can still round-trip safely. In live router
code, however, a nil attempt hash should never happen for newly
registered attempts.

Add an error log on the fallback path so an unexpected nil attempt hash
is surfaced immediately instead of silently persisting the fallback
value.
Copy link
Copy Markdown
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Excellent commit series as well.

LGTM 🐳

Dropped some non-blocking comments re query readability and potentail perf improvements.


SELECT DISTINCT ha.payment_id AS id
FROM payment_htlc_attempts ha
JOIN payment_htlc_attempt_resolutions hr
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this clause and the 1st both join on attempt resolutions, I think we can collapse the query to instead be:

  SELECT p.id FROM payments p
  WHERE p.fail_reason IS NULL
    AND NOT EXISTS (  
      SELECT 1 FROM payment_htlc_attempt_resolutions hr
      JOIN payment_htlc_attempts ha ON ha.attempt_index = hr.attempt_index
      WHERE ha.payment_id = p.id AND hr.resolution_type = 1
    )
  UNION
  SELECT ha.payment_id FROM payment_htlc_attempts ha
  WHERE NOT EXISTS (
      SELECT 1 FROM payment_htlc_attempt_resolutions hr
      WHERE hr.attempt_index = ha.attempt_index
  )

(not tested)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it queries that: not failed yet and not settled attempted + has an unresolved attempt

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool applied


// Convert map to slice and sort by sequence number to
// produce a deterministic ordering.
mpPayments = make([]*MPPayment, 0, len(processedPayments))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need this sorting step?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we no longer need the old Go-side sorting step. The current path appends one payment per selector row, and the selector query already returns rows ordered by p.id ASC

AND NOT EXISTS (
SELECT 1 FROM payment_htlc_attempts ha
WHERE ha.payment_id = p.id
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can consider adding an index like this to speed up this portion of the query:

CREATE INDEX IF NOT EXISTS idx_payments_no_fail_reason ON payments(id) WHERE fail_reason IS NULL;

Otherwise will scan, but iiuc this only is done on start up

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

if attempt.Hash != nil {
attemptHash = attempt.Hash[:]
} else {
log.Errorf("RegisterAttempt: attempt %d has nil hash, "+
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return an error instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we need it to keep the current behavior - KV already tolerates attempt.Hash == nil, and changing this to an error would be a behavior change rather than just a bug fix. We could revisit it tho, if we did see an error log.

Simplify FetchNonTerminalPayments by collapsing the selector down to
two branches: payments that are not failed and have no settled attempt,
and payments that still have unresolved attempts. This keeps the same
non-terminal semantics while making the query easier to reason about.

Also add a partial index on payments(id) where fail_reason IS NULL to
speed up the startup selector branch that scans payments without a
recorded failure reason.
@github-actions github-actions bot added the severity-critical Requires expert review - security/consensus critical label Apr 8, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

🔴 PR Severity: CRITICAL

Automated classification | 6 files (excl. tests/generated) | 279 lines changed

🔴 Critical (2 files)
  • sqldb/sqlc/migrations/000014_payments_no_fail_reason_index.down.sql - SQL database migration (sqldb/* migrations are always CRITICAL)
  • sqldb/sqlc/migrations/000014_payments_no_fail_reason_index.up.sql - SQL database migration (sqldb/* migrations are always CRITICAL)
🟠 High (3 files)
  • sqldb/sqlc/db_custom.go - sqldb/* package (HIGH)
  • sqldb/sqlc/querier.go - sqldb/* package (HIGH)
  • sqldb/sqlc/queries/payments.sql - sqldb/* package (HIGH)
🟡 Medium (1 file)
  • payments/db/sql_store.go - payments/* package (MEDIUM)

Analysis

This PR adds a new SQL database migration (000014_payments_no_fail_reason_index) that modifies the payments table schema — specifically adding an index for payments without a fail reason. Database migrations in sqldb/* are always classified as CRITICAL regardless of other factors, because schema changes are irreversible and can affect data integrity and query performance in production.

The remaining changes update the SQL query layer (payments.sql.go, querier.go, db_custom.go) to support the new index, and payments/db/sql_store.go is updated to use the new query interface. These are HIGH/MEDIUM on their own, but the migration drives the overall classification to CRITICAL.

This PR warrants expert review focused on:

  • Correctness and safety of the migration (up and down paths)
  • Index definition and its impact on query performance
  • Compatibility with existing payment store logic

To override, add a severity-override-{critical,high,medium,low} label.

@yyforyongyu yyforyongyu merged commit 9bf4f50 into lightningnetwork:master Apr 9, 2026
52 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in v0.21 Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug fix no-changelog severity-critical Requires expert review - security/consensus critical

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants