Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: retry with exponential backoff for DynamoDb interaction #1975

Merged
merged 2 commits into from
Jan 4, 2024

Conversation

dispanser
Copy link
Contributor

Description

We use an external crate, backoff, to retry DynamoDb read and write operations when the error in the response is ProvisionedThroughPutExceeded, indicating an overload of DynamoDb wrt to the configured read and write capacity.

@github-actions github-actions bot added binding/rust Issues for the Rust crate crate/core labels Dec 15, 2023
@dispanser
Copy link
Contributor Author

I decided against using the existing retry logic from crates/deltalake-core/src/data_catalog/client/retry.rs that was suggested by @roeap:

  • with the current crate structure, it's not possible to access from deltalake-aws crate
  • there's not much that can be directly reused, most of the logic resides in impl RetryExt for reqwest::RequestBuilder
  • the model (extension trait RetryExt) isn't really viable because we're retrying four different operations, so we'd be duplicating the actual retry handling (loop, considering max_retries, actual sleeping, etc) multiple times
  • I think the approach taking by backoff, marking errors as either transient or permanent, is self-documenting and easy to understand

@rtyler: We're currently only retrying ProvisionedThroughputExceeded, but for most of these error enums there's also the RequestLimitExceeded variant, documented via:

Throughput exceeds the current throughput quota for your account. Please contact AWS Support at AWS Support to request a quota increase

In theory, this indicates that the error is also transient, but I'm not an AWS expert enough to understand if it's sensible to handle this as well.

Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pulling the backoff crate into deltalake-aws is reasonable. The nice thing about the subcrates is that this new dependency only affects AWS users 😄

Comment on lines 316 to 317
.with_max_interval(Duration::from_secs(15))
.with_max_elapsed_time(Some(Duration::from_secs(60)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max elapsed interval of 60 I am assuming is just a number that you've picked?

I think this is something that will likely need to be configurable since different systems would have a different tolerance here. Based on how retry is being invoked I don't see a clear path to pulling configuration down into this call 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these numbers would ideally be configurable. It seems other aspects of delta-rs, including the locking provider choice itself, are being configured via environment variables, so I guess that's the way to go.

My question would mostly be if we want all five of possible parameters to be configurable:

  • max elapsed time
  • max interval
  • multiplier
  • initial interval (using defaults in this PR)
  • randomization factor (using defaults in this PR)

The most obvious one is probably the one you singled out, max elapsed time, but there's probably a user out there for any of these :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dispanser I agree there are folks likely that would want to tune every single one of these...maybe.

I think making an environment variable for max elapsed time is the only one really important to implement to merge this. The others people may want, or not, but they should make themselves known in the future 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an environment variable for the max elapsed time setting.

I'd address the diverging docs in a separate PR, they are already way off anyways :)

@dispanser dispanser force-pushed the dynamodb-retry-exponential-backoff branch from d478153 to 3b4d6d4 Compare December 18, 2023 09:02
@rtyler rtyler self-assigned this Dec 19, 2023
@dispanser dispanser force-pushed the dynamodb-retry-exponential-backoff branch from 3b4d6d4 to fa0bc7d Compare December 20, 2023 07:03
@dispanser
Copy link
Contributor Author

I think the failing test (macos, restore_by_datetime) is unrelated.

@dispanser dispanser force-pushed the dynamodb-retry-exponential-backoff branch 2 times, most recently from 82eeaca to 760b205 Compare December 30, 2023 11:05
@rtyler
Copy link
Member

rtyler commented Jan 3, 2024

@dispanser please rebase this on the latest main with the mega refactor 😄

@dispanser dispanser force-pushed the dynamodb-retry-exponential-backoff branch from 760b205 to 251dded Compare January 4, 2024 15:35
We use an external crate, [backoff](https://crates.io/crates/backoff),
to retry DynamoDb read and write operations when the error in the
response is `ProvisionedThroughPutExceeded`, indicating an overload of
DynamoDb wrt to the configured read and write capacity.
@dispanser dispanser force-pushed the dynamodb-retry-exponential-backoff branch from 251dded to caa6fb8 Compare January 4, 2024 15:37
@dispanser
Copy link
Contributor Author

@rtyler rebase is done, all tests pass locally.

@dispanser dispanser force-pushed the dynamodb-retry-exponential-backoff branch from caa6fb8 to 21338cd Compare January 4, 2024 16:53
@rtyler rtyler merged commit 9264ede into delta-io:main Jan 4, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants