paymentsdb: restore sql payment store parity with kv#10721
paymentsdb: restore sql payment store parity with kv#10721yyforyongyu merged 9 commits intolightningnetwork:masterfrom
Conversation
SQL writes store the payment identifier in payment_hash for each attempt. That is wrong for AMP payments, because the payment identifier is the SetID while each shard carries its own HTLC hash. Add TestRegisterAttemptPreservesAttemptHash as evidence. It passes on KV and fails on SQL before the fix because SQL reads the attempt hash back as the payment identifier.
Live SQL writes stored the payment identifier in payment_hash for each attempt. That works for legacy payments, but it breaks AMP because the payment identifier is the SetID while each shard carries its own HTLC hash. Use TestRegisterAttemptPreservesAttemptHash as evidence. The test now passes on both KV and SQL. Fix this by persisting attempt.Hash when it is present and only falling back to the payment identifier when the attempt hash is nil. That restores KV parity for AMP attempt reloads.
SQL FetchInFlightPayments only returns payments with an unresolved attempt row. KV returns every non-terminal payment, including retryable payments with only failed attempts and payments that have been initialized but have not registered any HTLCs yet. Add TestFetchInFlightPaymentsIncludesRetryablePayments and TestFetchInFlightPaymentsIncludesInitiatedPayments as evidence. Both tests pass on KV and fail on SQL before the fix.
FetchInFlightPayments needs a dedicated SQL query that can return non-terminal payments without relying on the unresolved-attempt scan. The first version of that query fixed correctness, but the follow-up selector measurements showed a UNION-based shape was materially faster while returning the same payment set. Use the inflight regression tests as evidence. The tests still fail on SQL before the Go payment store is wired up, but this commit adds the final SQL surface the later wiring commit depends on. Add FetchNonTerminalPayments to the SQL query set, regenerate the sqlc bindings, add the PaymentAndIntent adapters for the new row type, and use the UNION-based candidate selection so the final query shape lands in one commit.
The new FetchNonTerminalPayments query is available, but FetchInFlightPayments still uses the old unresolved-attempt scan until this commit. Use the inflight recovery regression tests as evidence. They now pass on both KV and SQL once the payment store is wired up to use the new query. Fix this by switching FetchInFlightPayments to the non-terminal payment query and batch loading only the related attempt and route data for those payment IDs.
FetchInFlightPayments no longer relies on the old FetchAllInflightAttempts query surface once the non-terminal payment query is in place. Use make lint and the inflight recovery tests as evidence. The code still passes once the old query and its generated bindings are removed. Remove FetchAllInflightAttempts from payments.sql, regenerate the sqlc bindings, and drop the matching SQLQueries interface method.
After removing the old FetchAllInflightAttempts query API, the helper types that only existed to batch load that path are no longer used. Use make lint as evidence. It reports the old inflight helper types as unused once the query API is gone. Remove the obsolete helper structs and batch-loading function from the payment SQL store so the remaining code matches the new inflight recovery path.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the SQL payment store's parity with the KV store by addressing issues related to AMP payment hash persistence and startup recovery. It introduces a more efficient query mechanism for identifying non-terminal payments and cleans up legacy code paths that were previously used for inflight attempt tracking. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the payment fetching logic in the SQL store to use a more efficient pagination approach based on non-terminal payments rather than individual HTLC attempts. It introduces a new FetchNonTerminalPayments query that correctly identifies payments that are still in-flight, initiated, or have retryable failures. Additionally, it ensures that HTLC attempt hashes are preserved independently of the payment identifier, which is critical for AMP payments. New unit tests have been added to verify these behaviors. I have no feedback to provide as the identified issue in the review was actually incorrect; rows.Close() is not redundant in this context as it is good practice to close rows explicitly to handle potential errors during the closing process, especially when not using a defer block.
| sessionKey := attempt.SessionKey() | ||
| sessionKeyBytes := sessionKey.Serialize() | ||
| attemptHash := paymentHash[:] | ||
| if attempt.Hash != nil { |
There was a problem hiding this comment.
should we log an error in case the hash is nil normally this should never happen or ?
There was a problem hiding this comment.
Good idea. I added 412db8ae7 to log the nil-hash fallback path so we get a diagnostic if a newly registered attempt ever reaches it unexpectedly. The legacy fallback behavior itself stays the same.
ziggie1984
left a comment
There was a problem hiding this comment.
LGTM
Thank you for the fix. The AMP fix but also the Inflight fix are not super severe since for the hash it is never used for critical operations in the paymentsDB. Moreover the fetching of payments which do not have an unresolved HTLC are also very rare and it would them only leave in an INFLIGHT state but without any potential loss of funds.
| SessionKey: sessionKeyBytes, | ||
| AttemptTime: attempt.AttemptTime, | ||
| PaymentHash: paymentHash[:], | ||
| PaymentHash: attemptHash, |
There was a problem hiding this comment.
Worth noting that while the stored hash was wrong for AMP payments, this did not cause observable payment failures. Settlement is driven by the preimage arriving on the wire, and
failure decryption relies solely on the session key and payment path — neither path consults the stored attempt hash. The bug was a data correctness issue rather than a live payment
breakage, which also explains why it went undetected.
There was a problem hiding this comment.
Agreed. I left the runtime behavior unchanged and treated this as a data-correctness fix rather than a live payment breakage. I only followed up with 412db8ae7 to add a diagnostic on the nil-hash fallback path.
Review: paymentsdb: restore sql payment store parity with kvVerdict: Approve with minor suggestions. What was verified
Impact of the AMP hash bugWorth noting that while the stored hash was wrong for AMP payments, this did not cause observable payment failures. Settlement is driven by the preimage arriving on the wire, and failure decryption relies solely on the session key and payment path — neither path consults the stored attempt hash. The bug was a data correctness issue rather than a live payment breakage, which also explains why it went undetected. Suggestions (non-blocking)1. Silent nil-hash fallback should log an error In attemptHash := paymentHash[:]
if attempt.Hash != nil {
attemptHash = attempt.Hash[:]
} else {
log.Errorf("RegisterAttempt: attempt %d has nil hash, falling "+
"back to payment identifier %v", attempt.AttemptID, paymentHash)
}2. Branch 1 lacks DISTINCT while Branch 3 has it Minor clarity nit: Branch 1 returns one row per unresolved attempt, so a payment with multiple in-flight shards produces duplicate payment_id values. The outer UNION deduplicates them No behaviour change, no performance impact — purely a readability improvement. 3. Docstring is inaccurate after the fix
4. Branch 2 full-table scan (minor, startup-only)
CREATE INDEX IF NOT EXISTS idx_payments_no_fail_reason
ON payments(id) WHERE fail_reason IS NULL;Both SQLite and PostgreSQL support partial indexes. Acceptable as a follow-up since this is startup-only. |
RegisterAttempt falls back to the payment identifier when an attempt hash is nil so legacy data can still round-trip safely. In live router code, however, a nil attempt hash should never happen for newly registered attempts. Add an error log on the fallback path so an unexpected nil attempt hash is surfaced immediately instead of silently persisting the fallback value.
Roasbeef
left a comment
There was a problem hiding this comment.
Great PR! Excellent commit series as well.
LGTM 🐳
Dropped some non-blocking comments re query readability and potentail perf improvements.
sqldb/sqlc/queries/payments.sql
Outdated
|
|
||
| SELECT DISTINCT ha.payment_id AS id | ||
| FROM payment_htlc_attempts ha | ||
| JOIN payment_htlc_attempt_resolutions hr |
There was a problem hiding this comment.
Given that this clause and the 1st both join on attempt resolutions, I think we can collapse the query to instead be:
SELECT p.id FROM payments p
WHERE p.fail_reason IS NULL
AND NOT EXISTS (
SELECT 1 FROM payment_htlc_attempt_resolutions hr
JOIN payment_htlc_attempts ha ON ha.attempt_index = hr.attempt_index
WHERE ha.payment_id = p.id AND hr.resolution_type = 1
)
UNION
SELECT ha.payment_id FROM payment_htlc_attempts ha
WHERE NOT EXISTS (
SELECT 1 FROM payment_htlc_attempt_resolutions hr
WHERE hr.attempt_index = ha.attempt_index
)(not tested)
There was a problem hiding this comment.
So it queries that: not failed yet and not settled attempted + has an unresolved attempt
|
|
||
| // Convert map to slice and sort by sequence number to | ||
| // produce a deterministic ordering. | ||
| mpPayments = make([]*MPPayment, 0, len(processedPayments)) |
There was a problem hiding this comment.
We no longer need this sorting step?
There was a problem hiding this comment.
yea we no longer need the old Go-side sorting step. The current path appends one payment per selector row, and the selector query already returns rows ordered by p.id ASC
| AND NOT EXISTS ( | ||
| SELECT 1 FROM payment_htlc_attempts ha | ||
| WHERE ha.payment_id = p.id | ||
| ) |
There was a problem hiding this comment.
Can consider adding an index like this to speed up this portion of the query:
CREATE INDEX IF NOT EXISTS idx_payments_no_fail_reason ON payments(id) WHERE fail_reason IS NULL;Otherwise will scan, but iiuc this only is done on start up
| if attempt.Hash != nil { | ||
| attemptHash = attempt.Hash[:] | ||
| } else { | ||
| log.Errorf("RegisterAttempt: attempt %d has nil hash, "+ |
There was a problem hiding this comment.
Should this return an error instead?
There was a problem hiding this comment.
Think we need it to keep the current behavior - KV already tolerates attempt.Hash == nil, and changing this to an error would be a behavior change rather than just a bug fix. We could revisit it tho, if we did see an error log.
Simplify FetchNonTerminalPayments by collapsing the selector down to two branches: payments that are not failed and have no settled attempt, and payments that still have unresolved attempts. This keeps the same non-terminal semantics while making the query easier to reason about. Also add a partial index on payments(id) where fail_reason IS NULL to speed up the startup selector branch that scans payments without a recorded failure reason.
🔴 PR Severity: CRITICAL
🔴 Critical (2 files)
🟠 High (3 files)
🟡 Medium (1 file)
AnalysisThis PR adds a new SQL database migration ( The remaining changes update the SQL query layer ( This PR warrants expert review focused on:
To override, add a |
Issues Found
HTLC attempts, so AMP attempts could reload with the wrong hash after a
restart.
non-terminal payments that KV already exposed through
FetchInFlightPayments.the correctness fix would not devolve into a full history scan.
What This PR Does
current KV behavior
UNION-based selector to improve the non-terminal payment queryTesting
make lintGOWORK=off go test ./payments/db ./routingGOWORK=off go test -tags=test_db_sqlite ./payments/dbGOWORK=off go test -tags=test_db_postgres ./payments/db