Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing+htlcswitch: fix stuck inflight payments #9150

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

yyforyongyu
Copy link
Collaborator

@yyforyongyu yyforyongyu commented Oct 2, 2024

Fix #8975

This PR aims at fixing a scenario where the payment could be stuck. Major changes are,

  • all the db updates for a given payment now happens in a single goroutine (resumePayment)
  • minor optimization on sphinx circuit creation - the error decryptor is now conditionally created when needed, i.e., when processing an update_fail_htlc.

TODOs

  • fix unit test in htlcswitch
  • fix unit test in routing

Copy link
Contributor

coderabbitai bot commented Oct 2, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@calvinrzachman calvinrzachman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi YY, made a quick pass through the change set. Looks pretty cool to me! Had a couple questions to make sure I understand the approach and also on whether the modification of GetAttemptResult is strictly necessary for this bug fix as it may have implications (albeit slight) on how the ChannelRouter can be used remotely.

@@ -123,7 +123,8 @@ type PaymentAttemptDispatcher interface {
// longer be in flight. The switch shutting down is signaled by
// closing the channel. If the attemptID is unknown,
// ErrPaymentIDNotFound will be returned.
GetAttemptResult(attemptID uint64, paymentHash lntypes.Hash,
GetAttemptResult(attempt *channeldb.HTLCAttempt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this PR moves to transporting the hop + session key information across the interface boundary by way of channeldb.HTLCAttempt type. This is what allows the forwarding error decryption to be performed one layer deeper within the Switch rather than within the ChannelRouter. This may have implications for ability to support oblivious payments by providing optionality in server versus client side decryption of forwarding errors.

Maybe this is too future thinking to be relevant, but such an opportunity is a bit more natural when you consider alternative deployment scenarios in which ChannelRouter and Switch run remotely in different processes - perhaps both controlled by the same entity (eg: gRPC reverse proxy to make multiple nodes act as a single logical node) or controlled by different parties in the case where the client would like to build onions and send payments in a way that is private from a hosted node infrastructure provider.

Whether forwarding error decryption should be conducted server/Switch side or by the caller of GetAttemptResult could previously be signaled by the presence of the deobfuscator:

// extractResult uses the given deobfuscator to extract the payment result from
// the given network message.
func (s *Switch) extractResult(deobfuscator ErrorDecrypter, n *networkResult,
	attemptID uint64, paymentHash lntypes.Hash) (*PaymentResult, error) {

...

// If the caller did not provide a deobfuscator, then we'll
// return the onion-encrypted blob that details why the HTLC was
// failed. This blob is only fully decryptable by the entity
// which built the onion packet.
if deobfuscator == nil {
	return &PaymentResult{
		EncryptedError: htlc.Reason,
	}, nil
}

I could imagine a similar approach to optional Switch error decryption by constructing a HTLCAttempt object which does not set the fields needed to perform forwarding error decryption in the Switch, but the attempt.Circuit() call would have to allow that information not being there so the Switch could instead return the encrypted error blob via the result.

In our case, we control both the remote ChannelRouter and the lnd/Switch instances, so whether the forwarding error decryption occurs in the router or switch isn't a huge concern since privacy across the RPC boundary is not relevant.

From the perspective of a router client controlled separately from the entity operating the lnd/Switch instances: With the removal of any ChannelRouter error decryption logic, the client application could not re-use the ChannelRouter type, but it could at least in principle provide a different implementation of a path-finding, onion building, and payment life-cycle manager component that could do client-side decryption of forwarding errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the optimization on sphinx circuit creation needed to fix the bug? Do we gain much by way performance or more logical code separation by burying the decryption of errors within the Switch?

Maybe we like the improvements this brings better than the currently somewhat hypothetical ability to allow ChannelRouter re-use in clients seeking to maintain privacy from hosted node providers. One argument for keeping the error decryption logic in the ChannelRouter could be that it does create the onion after all so will still need to handle or know about the hops + session key.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I followed - if you wanna create onions then you should use SendToOnions?

Is the optimization on sphinx circuit creation needed to fix the bug? Do we gain much by way performance or more logical code separation by burying the decryption of errors within the Switch?

Nope it's not needed but an improvement - an easy way to measure it is to run the payment benchmark, which increases a bit. This is more of a logical code separation since there's no need to re-create the error decryptor if we know it's not a fail but a settle.

One argument for keeping the error decryption logic in the ChannelRouter could be that it does create the onion after all so will still need to handle or know about the hops + session key.

I don't think that's the case, all paymentLifecycle needs is to call RequestRoute and pass the route to the htlcswitch.

Anyway I removed this change to limit the scope of this PR, think we'll take a look at it again when working on #8834.

result *htlcswitch.PaymentResult) bool {

// Save the keys so we know which items to delete from the map.
attempts = append(attempts, a)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only mark an attempt result for deletion if it is actually successfully handled without error by the call to handleAttemptResult below? If a result is not deleted from the map, it will be reprocessed when this function is again called on the next life-cycle loop iteration. On one hand retries sound good, but in this context maybe that risks a perpetually growing result map if certain results are borked and never able to be processed successfully 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I thought about this too - since the only error we get here is DB-related, I don't think retrying it again would help. This is also the current behavior. But I guess some logs would help so will add!

// Once the result is collected, we signal it by writing the
// error to `resultCollected`.
// Save the result and process it in the next main loop.
p.switchResults.Store(attempt, result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than update payment state in DB given an attempt result from collectResult, we instead defer any ControlTower/DB updates and simply collect and return the result so that we can update our DB state later from a centralized location. Am I following you correctly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly!

// means the payment lifecycle needs to be terminated.
_, err := p.handleAttemptResult(a, result)
if err != nil {
errReturned = err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any issue with masking errors of payment attempt results if there are multiple results in this map with an error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - imo it's fine to do so here as all we want is to catch at least one of the DB errors and abort the payment lifecycle.

payment.GetStatus(), ps.NumAttemptsInFlight,
ps.RemainingAmt, remainingFees)
// We update the payment state on every iteration.
currentPayment, ps, err := p.refreshPayment()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle any potential error from this call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@ziggie1984 ziggie1984 self-requested a review October 14, 2024 06:34
@yyforyongyu yyforyongyu force-pushed the fix-stuck-payment branch 2 times, most recently from abd6e94 to 930fadf Compare October 14, 2024 12:35
@yyforyongyu yyforyongyu added payments Related to invoices/payments bug fix size/kilo medium, proper context needed, less than 1000 lines labels Oct 14, 2024
@yyforyongyu yyforyongyu added this to the v0.19.0 milestone Oct 14, 2024
@yyforyongyu yyforyongyu self-assigned this Oct 14, 2024
@yyforyongyu yyforyongyu marked this pull request as ready for review October 14, 2024 12:37
This commit caches the creation of sphinx circuit and onion blob to
avoid re-creating them again.
To shorten the method `resumePayment` and make each step more clear.
To further shorten the lifecycle loop.
This commit refactors `collectResultAsync` such that this method is now
only responsible for collecting results from the switch. A new method
`processSwitchResults` is added to process these results in the same
goroutine where we fetch the payment from db, to make sure the lifecycle
loop always have a consistent view of a given payment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix payments Related to invoices/payments size/kilo medium, proper context needed, less than 1000 lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight
2 participants