Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk)!: detect stale nodes #2254

Merged
merged 16 commits into from
Oct 19, 2024
Merged

feat(sdk)!: detect stale nodes #2254

merged 16 commits into from
Oct 19, 2024

Conversation

lklimek
Copy link
Contributor

@lklimek lklimek commented Oct 18, 2024

Issue being fixed or feature implemented

When fetching information from a node that has chain halt, we receive outdated information.

What was done?

Implemented two mechanisms to detect stale nodes:

  1. Ensure height of received response is not older than the last seen height minus some tolerance (defaults to 1).
  2. Ensure time of received response is within some time window (mechanism disabled by default).
  3. Added error StaleNodeError to Sdk that is emitted by Sdk if the conditions above are not met.
  4. Added CanRetry trait to dash_sdk::Error that will return true is the request should be retried

How Has This Been Tested?

Added tests.

Breaking Changes

Queries to nodes that are out of sync return StaleNodeError instead of outdated information.

rs_dapi_client::CanRetry::is_node_failure() is deprecated in favor of can_retry().

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes in the corresponding section if my code contains any
  • I have made corresponding changes to the documentation if needed

For repository code-owners and collaborators only

  • I have assigned this pull request to a milestone

Summary by CodeRabbit

  • New Features

    • Enhanced error handling with a new StaleNodeError type for detailed reporting of stale metadata issues.
    • Added retry logic improvements in the DAPI client, allowing for more nuanced error handling.
    • Introduced new fields in the SDK for managing proof staleness based on height and time.
    • Added methods for verifying proof metadata against defined tolerances.
  • Bug Fixes

    • Updated logic for determining retry capabilities based on error types, improving reliability during DAPI requests.
  • Documentation

    • Improved clarity in documentation comments related to the new retry logic and error handling.
  • Tests

    • Added unit tests for new verification functions to ensure proof validity checks work as intended.

Copy link
Contributor

coderabbitai bot commented Oct 18, 2024

Walkthrough

The changes introduced in this pull request primarily focus on enhancing error handling and retry logic within the DAPI client and SDK. Key modifications include the replacement of the is_node_failure method with the can_retry method, which returns a bool, allowing for more nuanced error categorization. Additionally, new error types related to stale nodes have been defined, and the SDK has been updated to incorporate fields and methods for managing metadata staleness. These changes collectively improve the flexibility and clarity of error management in the system.

Changes

File Path Change Summary
packages/rs-dapi-client/src/dapi_client.rs Updated DapiClientError enum with can_retry method; modified error handling in execute.
packages/rs-dapi-client/src/lib.rs Added can_retry method to CanRetry trait; deprecated is_node_failure.
packages/rs-dapi-client/src/transport/grpc.rs Replaced is_node_failure with can_retry in dapi_grpc::tonic::Status implementation.
packages/rs-sdk/src/error.rs Added CanRetry implementation for Error enum; introduced StaleNodeError enum.
packages/rs-sdk/src/sdk.rs Added fields for metadata height and time tolerances; updated methods for metadata verification.

Possibly related PRs

  • feat(sdk): fix client tls connections #2223: The changes in this PR involve modifications to the dapi_client.rs file, specifically enhancing error handling in the execute method, which is directly related to the updates made to the DapiClientError enum and its can_retry method in the main PR.

Suggested reviewers

  • QuantumExplorer
  • pshenmic

🐰 In the garden where errors once bloomed,
A new path of retries has now been groomed.
With staleness defined and logic refined,
Our DAPI client’s future is brightly aligned!
So hop along, let’s code and create,
For every small change, we celebrate! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (9)
packages/rs-dapi-client/src/lib.rs (1)

74-88: Approved: Excellent improvements to error handling and retry logic

The changes to the CanRetry trait are well-implemented and align perfectly with the PR objectives. The new can_retry method provides more flexibility in error handling, while the deprecation of is_node_failure maintains backward compatibility.

A minor suggestion for improvement:

Consider adding a brief example in the documentation for can_retry to illustrate its usage, especially how to handle the None case. This would further enhance the clarity for developers using this trait.

Example:

/// Returns true if the operation can be retried.
/// `None` means unspecified - in this case, you should either inspect the error
/// in more details, or assume `false`.
///
/// # Example
/// ```
/// match error.can_retry() {
///     Some(true) => retry_operation(),
///     Some(false) => handle_non_retryable_error(),
///     None => {
///         // Inspect error in more detail or assume false
///         if should_retry_based_on_error_details(error) {
///             retry_operation()
///         } else {
///             handle_non_retryable_error()
///         }
///     }
/// }
/// ```
fn can_retry(&self) -> Option<bool>;
packages/rs-sdk/src/error.rs (1)

84-100: LGTM: CanRetry trait implementation looks good.

The implementation correctly identifies retryable errors as per the PR objectives, covering stale proofs, DAPI client errors, core client errors, and timeout errors. The use of the matches! macro results in concise and readable code.

Consider adding a comment explaining the rationale behind categorizing these specific errors as retryable. This would enhance code maintainability and make it easier for other developers to understand and potentially extend this logic in the future.

Example:

impl CanRetry for Error {
    fn can_retry(&self) -> Option<bool> {
        // The following errors are considered retryable:
        // - Stale proofs: The node might have updated data on a subsequent request
        // - DAPI and Core client errors: These might be transient network issues
        // - Timeout errors: A retry might succeed if the timeout was due to temporary congestion
        let retry = matches!(
            self,
            Error::Proof(drive_proof_verifier::Error::StaleProof(..))
                | Error::DapiClientError(_)
                | Error::CoreClientError(_)
                | Error::TimeoutReached(_, _)
        );

        if retry {
            Some(true)
        } else {
            None
        }
    }
}
packages/rs-sdk/Cargo.toml (1)

9-9: LGTM! Consider specifying features for chrono.

The addition of chrono as a regular dependency aligns well with the PR objectives, particularly for implementing time-based checks for stale proofs. The version specified (0.4.38) is relatively recent, which is good for security and feature support.

Consider specifying only the features you need for chrono to minimize the dependency footprint. For example:

-chrono = { version = "0.4.38" }
+chrono = { version = "0.4.38", default-features = false, features = ["clock"] }

This assumes you only need the clock feature. Adjust the features based on your specific requirements.

packages/rs-dapi-client/src/transport/grpc.rs (2)

Line range hint 120-141: Approved: Improved retry logic with more flexibility

The changes to the can_retry method are well-implemented and align with the PR objectives. The use of Option<bool> provides more flexibility in error handling, and the inverted logic makes it clearer which cases are retryable.

Consider adding a brief comment explaining the logic behind the retry variable, e.g.:

// Determine if retry is possible based on the status code
// We retry for all codes except those explicitly listed
let retry = !matches!(
    code,
    Ok | DataLoss | Cancelled | Unknown | DeadlineExceeded | ResourceExhausted | Aborted | Internal | Unavailable
);

This would enhance code readability and make the intention behind the logic more explicit.


Line range hint 1-541: Suggestion: Improve file organization and documentation

While the changes in this file are focused on the can_retry method, there are some general improvements that could enhance the overall quality of the file:

  1. Consider grouping the impl_transport_request_grpc! macro invocations by client type (Platform and Core) using comments or regions for better organization.

  2. Add a brief documentation comment at the beginning of the file explaining its purpose and the main components it contains.

  3. Consider adding a TODO comment to explore if any of the repeated impl_transport_request_grpc! macro invocations can be further abstracted or generated programmatically to reduce code duplication.

These suggestions are not directly related to the current changes but could improve the overall maintainability of the file in the future.

packages/rs-dapi-client/src/dapi_client.rs (1)

Line range hint 236-246: Handle None cases explicitly when checking error.can_retry()

Using error.can_retry().unwrap_or(false) treats None as false, which may not reflect the intended behavior for indeterminate retry cases. To prevent unintended consequences, handle the None case explicitly.

Consider applying this change to handle None explicitly:

- if !error.can_retry().unwrap_or(false) {
+ match error.can_retry() {
+     Some(true) => {
+         // Error is retryable; no action needed here
+     },
+     Some(false) => {
+         // Error is not retryable; ban the address if necessary
+         if applied_settings.ban_failed_address {
+             let mut address_list = self
+                 .address_list
+                 .write()
+                 .expect("can't get address list for write");
+
+             address_list.ban_address(&address)
+                 .map_err(DapiClientError::<<R::Client as TransportClient>::Error>::AddressList)?;
+         }
+     },
+     None => {
+         // Indeterminate retryability; decide how to handle this case
+         // Possibly log or take default action
+     },
+ }
packages/rs-sdk/src/sdk.rs (3)

641-641: Incomplete 'FIXME' Comment

The comment at line 641 seems incomplete: // FIXME: in future, we need to implement t. Please complete or clarify the intended message.


751-751: Typo in Documentation Comment

In the documentation comment at line 751, "fefault" should be "default".


919-919: Typo in Documentation Comment

In the documentation comment at line 919, "fefault" should be "default".

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 00df5a6 and 34fee90.

📒 Files selected for processing (7)
  • packages/rs-dapi-client/src/dapi_client.rs (3 hunks)
  • packages/rs-dapi-client/src/lib.rs (1 hunks)
  • packages/rs-dapi-client/src/transport/grpc.rs (2 hunks)
  • packages/rs-drive-proof-verifier/src/error.rs (1 hunks)
  • packages/rs-sdk/Cargo.toml (1 hunks)
  • packages/rs-sdk/src/error.rs (2 hunks)
  • packages/rs-sdk/src/sdk.rs (13 hunks)
🧰 Additional context used
🔇 Additional comments (4)
packages/rs-sdk/src/error.rs (2)

8-8: LGTM: Import statement updated correctly.

The addition of CanRetry to the import statement is consistent with the new trait implementation and aligns with the PR objectives of improving error handling and retry logic.


Line range hint 1-100: Overall assessment: Changes align well with PR objectives.

The modifications to this file effectively implement the desired improvements in error handling and retry logic. The new CanRetry trait implementation provides a clear mechanism for identifying retryable errors, which should enhance the SDK's ability to handle transient issues.

packages/rs-drive-proof-verifier/src/error.rs (1)

85-89: Verify that all error handling covers the new StaleProof variant

The addition of the StaleProof variant to the Error enum may require updates in parts of the codebase where Error is matched or handled. Ensure that all pattern matches and error handling logic have been updated accordingly to incorporate this new variant.

Run the following script to locate all match statements involving Error:

This will help identify any areas that need to be updated to handle StaleProof appropriately.

packages/rs-dapi-client/src/dapi_client.rs (1)

42-49: Clarify None handling in can_retry method

The can_retry method returns Option<bool>, where None indicates indeterminate retry capability. Currently, the Mock variant returns None. Please confirm if returning None for Mock errors is intentional, or consider returning Some(true) or Some(false) to specify the retry behavior explicitly.

packages/rs-drive-proof-verifier/src/error.rs Outdated Show resolved Hide resolved
packages/rs-drive-proof-verifier/src/error.rs Outdated Show resolved Hide resolved
packages/rs-dapi-client/src/dapi_client.rs Outdated Show resolved Hide resolved
packages/rs-dapi-client/src/dapi_client.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)
packages/rs-drive-proof-verifier/src/error.rs (1)

95-96: Minor improvement for consistency in error messages

The error messages in StaleProofError are informative and well-structured. However, there's a small inconsistency in terminology between the two variants:

  • StaleProofHeight uses "received" in both the field name and error message.
  • Time (soon to be StaleProofTime) uses "received" in the field name but "actual" in the error message.

For better consistency, consider using "received" in both error messages.

Suggested change for the Time variant's error message:

#[error(
    "stale proof time: expected {expected_timestamp_ms}ms, received {received_timestamp_ms} ms, tolerance {tolerance_ms} ms; try another server"
)]

This small change will make the error messages more consistent across both variants.

Also applies to: 105-107

packages/rs-sdk/src/sdk.rs (4)

104-121: LGTM! New fields for stale proof detection.

The addition of previous_proof_height, proof_height_tolerance, and proof_time_tolerance_ms fields to the Sdk struct implements the stale proof detection mechanism as described in the PR objectives. This is a good approach to ensure the freshness of received proofs.

Consider adding documentation comments for these new fields to explain their purpose and default values, especially for proof_time_tolerance_ms which is set to None by default.


577-615: Approve time verification logic with a suggestion.

The verify_proof_time function correctly implements the time-based stale proof detection. However, there's a potential issue with using the local system time.

Consider using a more reliable time source or implementing a time synchronization mechanism. The current implementation assumes that the local system time is accurate, which may not always be the case. This could lead to false positives or negatives in stale proof detection.

One possible approach is to use a Network Time Protocol (NTP) client to periodically synchronize the local time with a reliable time server. This would help ensure more accurate time comparisons.


742-754: LGTM! Configuration options for stale proof detection.

The addition of proof_height_tolerance and proof_time_tolerance_ms fields to the SdkBuilder struct, along with their respective with_* methods, provides good flexibility for users to configure the stale proof detection mechanism.

In the with_proof_time_tolerance method documentation, there's a typo in the word "default". It's written as "fefault". Please correct this typo.

-    /// This is set to `None` by fefault.
+    /// This is set to `None` by default.

Also applies to: 903-930


1090-1216: LGTM! Comprehensive test coverage for stale proof detection.

The new test module provides good coverage for the verify_proof_height and verify_proof_time functions. The use of test_matrix allows for testing multiple scenarios efficiently.

Consider adding a few more edge cases to the test_verify_proof_time function:

  1. Test with u64::MAX for received and now to ensure no overflow issues.
  2. Test with very large tolerance values to ensure they're handled correctly.

These additional tests would help ensure the robustness of the time verification logic under extreme conditions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 34fee90 and b1bc08e.

📒 Files selected for processing (3)
  • packages/rs-dapi-client/src/dapi_client.rs (3 hunks)
  • packages/rs-drive-proof-verifier/src/error.rs (1 hunks)
  • packages/rs-sdk/src/sdk.rs (13 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/rs-dapi-client/src/dapi_client.rs
🧰 Additional context used
🔇 Additional comments (4)
packages/rs-drive-proof-verifier/src/error.rs (2)

85-89: LGTM: StaleProof variant added correctly

The addition of the StaleProof variant to the Error enum is well-implemented. The use of #[error(transparent)] and #[from] attributes ensures proper error propagation and conversion from StaleProofError. This change aligns with the PR objective of detecting stale proofs.


85-115: Summary: Solid implementation of stale proof detection

The changes introduced in this file effectively implement the stale proof detection mechanism as outlined in the PR objectives. The new StaleProof variant in the Error enum and the StaleProofError enum provide a robust structure for handling and reporting stale proof errors.

The implementation is well-thought-out, with detailed error messages that include all relevant information. The suggested improvements are minor and focus on enhancing consistency and clarity in naming and error messages.

Overall, this is a strong addition to the error handling system that will improve the ability to detect and respond to stale proofs in the system.

packages/rs-sdk/src/sdk.rs (2)

617-679: LGTM! Robust height verification logic.

The verify_proof_height function implements a solid approach to height-based stale proof detection. It handles various edge cases well, including:

  1. Same height proofs
  2. Initial SDK startup (when prev is less than tolerance)
  3. Concurrent updates from multiple threads

The use of atomic operations ensures thread-safety, which is crucial for this shared state.


Line range hint 1-1216: Overall LGTM! Solid implementation of stale proof detection.

The changes in this file successfully implement the stale proof detection mechanism as described in the PR objectives. Key points:

  1. The Sdk struct now includes fields for tracking and configuring proof staleness.
  2. New helper functions verify_proof_time and verify_proof_height implement the core logic for detecting stale proofs.
  3. The SdkBuilder provides configuration options for users to customize the stale proof detection behavior.
  4. Comprehensive test coverage ensures the reliability of the new functionality.

These changes will help mitigate the issue of receiving outdated information when fetching data from nodes that have experienced a chain halt.

To ensure that the new stale proof detection mechanism is properly integrated, let's verify its usage in other parts of the codebase:

This will help confirm that the new functionality is being used correctly throughout the project.

✅ Verification successful

Line range hint 1-1216: Overall LGTM! Solid implementation of stale proof detection.

The changes in this file successfully implement the stale proof detection mechanism as described in the PR objectives. Key points:

  1. The Sdk struct now includes fields for tracking and configuring proof staleness.
  2. New helper functions verify_proof_time and verify_proof_height implement the core logic for detecting stale proofs.
  3. The SdkBuilder provides configuration options for users to customize the stale proof detection behavior.
  4. Comprehensive test coverage ensures the reliability of the new functionality.

These changes will help mitigate the issue of receiving outdated information when fetching data from nodes that have experienced a chain halt.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of verify_proof_metadata
rg --type rust 'verify_proof_metadata'

# Search for usage of with_proof_height_tolerance and with_proof_time_tolerance
rg --type rust 'with_proof_height_tolerance|with_proof_time_tolerance'

Length of output: 751

packages/rs-drive-proof-verifier/src/error.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
packages/rs-sdk/src/sdk.rs (4)

243-248: LGTM: Enhanced proof verification in parse_proof_with_metadata_and_proof.

The addition of the verify_proof_metadata call in the parse_proof_with_metadata_and_proof method improves the robustness of proof verification. This change aligns well with the new fields added to the Sdk struct.

Consider adding a comment explaining the purpose of the verify_proof_metadata call for better code readability.


577-615: LGTM: Well-implemented verify_proof_time function.

The verify_proof_time function correctly checks if the proof time is within the specified tolerance of the current time. It includes appropriate logging and returns a detailed error when the time is out of tolerance.

Consider using a constant for the log level (e.g., warn!) instead of hardcoding it, to make it easier to adjust logging verbosity in the future.


617-677: LGTM: Well-implemented verify_proof_height function with proper concurrency handling.

The verify_proof_height function correctly checks if the proof height is within the specified tolerance of the previous proof height. It uses atomic operations to handle concurrent access to the previous proof height, ensuring thread safety. The function includes appropriate logging and returns a detailed error when the height is out of tolerance.

Consider adding a comment explaining the purpose of the while loop (lines 666-674) for better code readability, as the atomic compare-and-swap logic might not be immediately obvious to all readers.


1088-1214: LGTM: Comprehensive test coverage for new proof verification functionality.

The new test module provides excellent coverage for the verify_proof_height and verify_proof_time functions, as well as the behavior of cloned SDK instances with respect to proof height verification. The use of test matrices allows for thorough testing of various input combinations, covering both valid and invalid cases.

Consider adding a few more edge cases to the test_verify_proof_time test matrix, such as testing with very large time values or the maximum possible u64 value, to ensure robustness against potential overflow scenarios.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between b1bc08e and f4b4e8b.

📒 Files selected for processing (2)
  • packages/rs-drive-proof-verifier/src/error.rs (1 hunks)
  • packages/rs-sdk/src/sdk.rs (13 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/rs-drive-proof-verifier/src/error.rs
🧰 Additional context used
🔇 Additional comments (4)
packages/rs-sdk/src/sdk.rs (4)

104-121: LGTM: New fields for proof validation.

The addition of previous_proof_height, proof_height_tolerance, and proof_time_tolerance_ms fields to the Sdk struct enhances the SDK's ability to validate proof freshness. These fields are well-typed and appropriately scoped for their intended use.


250-267: LGTM: Well-implemented verify_proof_metadata method.

The verify_proof_metadata method effectively utilizes the new fields to perform comprehensive proof validation. It checks both proof height and time when the respective tolerances are set, with appropriate error handling.


Line range hint 740-928: LGTM: Well-implemented additions to SdkBuilder for configuring proof tolerances.

The additions to the SdkBuilder struct and its implementation, including the new fields proof_height_tolerance and proof_time_tolerance_ms, along with their corresponding configuration methods with_proof_height_tolerance and with_proof_time_tolerance, provide a clean and flexible way to set up proof validation tolerances during SDK initialization. These changes are consistent with the new fields added to the Sdk struct.

The comments for with_proof_time_tolerance method provide valuable information about the implications of enabling time-based proof checks. This is excellent documentation that will help users make informed decisions when configuring the SDK.


Line range hint 1-1214: Overall assessment: Excellent implementation of proof staleness detection.

The changes introduced in this file significantly enhance the SDK's ability to detect stale proofs received from nodes. The implementation aligns well with the PR objectives and includes:

  1. New fields in the Sdk struct for tracking and configuring proof validation.
  2. Enhanced proof verification in existing methods.
  3. New functions for verifying proof time and height.
  4. Updates to the SdkBuilder for flexible configuration of proof validation tolerances.
  5. Comprehensive test coverage for the new functionality.

The implementation is thread-safe, well-documented, and provides users with the flexibility to configure proof staleness detection according to their needs. These changes should effectively address the issue of receiving outdated information when fetching data from nodes that have experienced a chain halt.

packages/rs-drive-proof-verifier/src/error.rs Outdated Show resolved Hide resolved
packages/rs-sdk/src/sdk.rs Outdated Show resolved Hide resolved
packages/rs-sdk/src/sdk.rs Outdated Show resolved Hide resolved
packages/rs-sdk/src/sdk.rs Outdated Show resolved Hide resolved
packages/rs-dapi-client/src/lib.rs Show resolved Hide resolved
@lklimek lklimek requested a review from shumkov October 18, 2024 15:23
@lklimek
Copy link
Contributor Author

lklimek commented Oct 18, 2024

@coderabbitai full review

Copy link
Contributor

coderabbitai bot commented Oct 18, 2024

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)
packages/rs-sdk/src/error.rs (2)

70-73: LGTM: New StaleNode variant added correctly.

The addition of the StaleNode variant to the Error enum aligns well with the PR objective of detecting stale proofs. The use of #[from] and #[error(transparent)] attributes is appropriate for error conversion and propagation.

Consider adding a brief doc comment to explain the purpose of this variant, e.g.:

/// Error indicating that the node provided stale data
#[error(transparent)]
StaleNode(#[from] StaleNodeError),

100-125: LGTM: StaleNodeError enum defined correctly with comprehensive error handling.

The StaleNodeError enum is well-defined and aligns perfectly with the PR objective of detecting stale proofs based on height and time. Both variants (Height and Time) include detailed error messages and relevant fields, which will be helpful for debugging and error reporting.

The use of the #[error] attribute for custom error formatting is appropriate, and the inclusion of doc comments for the enum and its fields is commendable.

Consider adding a brief example in the doc comment for the StaleNodeError enum to illustrate its usage, e.g.:

/// Server returned stale metadata
///
/// # Examples
///
/// ```
/// use your_crate::StaleNodeError;
///
/// let height_error = StaleNodeError::Height {
///     expected_height: 100,
///     received_height: 90,
///     tolerance_blocks: 5,
/// };
/// 
/// let time_error = StaleNodeError::Time {
///     expected_timestamp_ms: 1000,
///     received_timestamp_ms: 900,
///     tolerance_ms: 50,
/// };
/// ```
#[derive(Debug, thiserror::Error)]
pub enum StaleNodeError {
    // ... (rest of the enum definition)
}

This addition would enhance the documentation and provide a clear usage example for developers.

packages/rs-dapi-client/src/transport/grpc.rs (1)

Line range hint 120-132: Approved: Improved error handling and retry logic

The changes to the CanRetry trait implementation for dapi_grpc::tonic::Status are well-thought-out and align with the PR objectives. The new can_retry method name is more descriptive, and the inverted logic using the matches! macro improves readability and maintainability.

Consider adding a blank line after the use dapi_grpc::tonic::Code::*; statement for better code readability:

 fn can_retry(&self) -> bool {
     let code = self.code();

     use dapi_grpc::tonic::Code::*;
+

     !matches!(
         code,
         Ok | DataLoss
             | Cancelled
             | Unknown
             | DeadlineExceeded
             | ResourceExhausted
             | Aborted
             | Internal
             | Unavailable
     )
 }
packages/rs-dapi-client/src/dapi_client.rs (2)

Line range hint 236-242: Re-evaluate Address Banning on Non-Retryable Errors

In the error handling block, when !error.can_retry() is true (i.e., the error is non-retryable), the code bans the failed address:

if !error.can_retry() {
    if applied_settings.ban_failed_address {
        // Ban the address
    }
}

This might lead to banning addresses due to non-recoverable errors, which could be outside the node's control (e.g., client-side issues).

Consider modifying the logic to ban addresses only on retryable errors that persist, indicating a problem with the node:

-if !error.can_retry() {
+if error.can_retry() {

This change ensures that only addresses causing persistent, retryable errors are banned, improving the robustness of the client.


272-274: Log Retryable Errors Appropriately

Currently, only non-retryable errors are logged at the error level:

if !error.can_retry() {
    tracing::error!(?error, "request failed");
}

Consider also logging retryable errors, perhaps at the warn level, to provide visibility into issues that require retries:

if !error.can_retry() {
    tracing::error!(?error, "request failed");
+} else {
+    tracing::warn!(?error, "retryable error occurred");
}

This enhancement aids in monitoring and debugging by highlighting errors that cause retries.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between f4b4e8b and 46aeeee.

📒 Files selected for processing (5)
  • packages/rs-dapi-client/src/dapi_client.rs (3 hunks)
  • packages/rs-dapi-client/src/lib.rs (1 hunks)
  • packages/rs-dapi-client/src/transport/grpc.rs (1 hunks)
  • packages/rs-sdk/src/error.rs (3 hunks)
  • packages/rs-sdk/src/sdk.rs (15 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/rs-dapi-client/src/lib.rs
  • packages/rs-sdk/src/sdk.rs
🧰 Additional context used
🔇 Additional comments (3)
packages/rs-sdk/src/error.rs (2)

8-8: LGTM: Import statement updated correctly.

The import statement has been properly updated to include the CanRetry trait, which is consistent with its implementation for the Error enum later in the file.


88-98: LGTM: CanRetry trait implemented correctly.

The implementation of the CanRetry trait for the Error enum is well-done. It correctly includes the new StaleNode error in the list of retryable errors, along with other relevant error types. The use of the matches! macro is appropriate and efficient for this kind of check.

This implementation aligns well with the PR objective of improving retry logic in the SDK.

packages/rs-dapi-client/src/dapi_client.rs (1)

42-47: Verify can_retry Logic for All Error Variants

The can_retry method returns true for NoAvailableAddresses and AddressList errors, and delegates to transport_error.can_retry() for Transport errors. For Mock errors, it returns false.

Please confirm that this logic aligns with the intended retry policy:

  • NoAvailableAddresses: Retrying may not be effective if there are truly no available addresses. Consider whether it should return false to prevent unnecessary retries.
  • AddressList Errors: Assess if these errors are recoverable through retries.

Ensuring the can_retry logic accurately reflects which errors are retryable will prevent futile retries and potential resource wastage.

packages/rs-dapi-client/src/dapi_client.rs Outdated Show resolved Hide resolved
packages/rs-sdk/src/error.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (9)
packages/rs-dapi-client/src/lib.rs (1)

74-85: Excellent improvements to the CanRetry trait!

The changes to the CanRetry trait are well-implemented and improve its functionality:

  1. The new can_retry method provides a clear and simple interface.
  2. The deprecation of is_node_failure is handled gracefully, maintaining backward compatibility while encouraging the use of the new method.
  3. The trait documentation has been updated to reflect these changes.

These modifications enhance the trait's clarity and align well with Rust best practices.

Consider adding a brief explanation in the trait-level documentation about when an operation might not be retryable. This could provide valuable context for users of the trait.

 /// Returns true if the operation can be retried.
+/// 
+/// Some operations may not be safely retried due to potential side effects or other constraints.
+/// Implementors of this trait should carefully consider the conditions under which an operation
+/// can be safely retried.
 pub trait CanRetry {
     /// Returns true if the operation can be retried safely.
     fn can_retry(&self) -> bool;
packages/rs-sdk/Cargo.toml (1)

9-9: LGTM! Consider adding a comment for clarity.

The addition of chrono as a regular dependency is appropriate given the new time-based proof validation feature mentioned in the PR description. The version specified (0.4.38) is recent and exact, which is good for reproducibility.

Consider adding a brief comment explaining why chrono is needed, e.g.:

# Required for time-based proof validation
chrono = { version = "0.4.38" }

This will help future maintainers understand the purpose of this dependency.

packages/rs-sdk/src/error.rs (3)

70-73: LGTM: New StaleNode variant added correctly.

The StaleNode variant has been appropriately added to the Error enum, using the #[from] attribute for automatic conversion from StaleNodeError. This addition aligns well with the PR objective of detecting stale proofs from nodes.

Consider adding a brief doc comment to explain the purpose of this new error variant, for example:

/// Error indicating that the node provided stale data
#[error(transparent)]
StaleNode(#[from] StaleNodeError),

88-98: LGTM: CanRetry trait implemented correctly.

The CanRetry trait has been properly implemented for the Error enum. The can_retry method correctly identifies which error types are eligible for retry attempts.

To improve maintainability, consider defining a constant array of retryable error variants:

impl CanRetry for Error {
    const RETRYABLE_ERRORS: [Error; 4] = [
        Error::StaleNode(..),
        Error::DapiClientError(_),
        Error::CoreClientError(_),
        Error::TimeoutReached(_, _),
    ];

    fn can_retry(&self) -> bool {
        Self::RETRYABLE_ERRORS.iter().any(|e| std::mem::discriminant(e) == std::mem::discriminant(self))
    }
}

This approach centralizes the list of retryable errors, making it easier to maintain and update in the future.


100-125: LGTM: StaleNodeError enum added with comprehensive error information.

The StaleNodeError enum has been well-designed to provide detailed information about stale node errors. Both Height and Time variants include all necessary fields to understand the nature of the staleness.

For consistency with Rust naming conventions, consider changing the field names in the Time variant to snake_case:

Time {
    /// Expected time in milliseconds - is local time when the message was received
    expected_timestamp_ms: u64,
    /// Time received from the server in the message, in milliseconds
    received_timestamp_ms: u64,
    /// Tolerance in milliseconds
    tolerance_ms: u64,
}

This change would make the field names consistent with those in the Height variant.

packages/rs-dapi-client/src/transport/grpc.rs (1)

Line range hint 120-133: Approve changes with a minor suggestion for improvement

The refactoring of the CanRetry implementation for dapi_grpc::tonic::Status is well done. The new method name can_retry is more descriptive and aligns better with the trait name. The inverted logic using the matches! macro improves readability and maintainability.

However, to further enhance code organization and readability, consider extracting the list of status codes that prevent retrying into a constant or a separate function. This would make the main can_retry function more concise and easier to understand at a glance.

Here's a suggested refactoring:

impl CanRetry for dapi_grpc::tonic::Status {
    fn can_retry(&self) -> bool {
        !is_non_retryable_status(self.code())
    }
}

fn is_non_retryable_status(code: tonic::Code) -> bool {
    use dapi_grpc::tonic::Code::*;
    matches!(
        code,
        Ok | DataLoss | Cancelled | Unknown | DeadlineExceeded | ResourceExhausted | Aborted | Internal | Unavailable
    )
}

This refactoring separates the concerns of determining which status codes are non-retryable from the main can_retry logic, making the code more modular and easier to maintain.

packages/rs-sdk/src/sdk.rs (3)

762-762: Typo in Error Message

There's a typo in the error message: "data conttact cache size must be positive". It should be "data contract cache size must be positive".

Apply this diff to fix the typo:

 .expect("data conttact cache size must be positive"),
+.expect("data contract cache size must be positive"),

888-922: Clarify Documentation for Tolerance Methods

The documentation for with_height_tolerance and with_time_tolerance methods is comprehensive. However, consider adding examples to demonstrate how to use these methods effectively and explain the implications of different tolerance settings on staleness detection.


1081-1218: Test Coverage for Edge Cases

The added tests in the test module are valuable for verifying the staleness logic. To enhance test coverage, consider adding more edge case scenarios, such as:

  • Testing the behavior when metadata_height_tolerance or metadata_time_tolerance_ms is None.
  • Simulating scenarios where the expected_height is exactly equal to tolerance.
  • Verifying the behavior when system time moves backward.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 00df5a6 and 46aeeee.

📒 Files selected for processing (6)
  • packages/rs-dapi-client/src/dapi_client.rs (3 hunks)
  • packages/rs-dapi-client/src/lib.rs (1 hunks)
  • packages/rs-dapi-client/src/transport/grpc.rs (1 hunks)
  • packages/rs-sdk/Cargo.toml (1 hunks)
  • packages/rs-sdk/src/error.rs (3 hunks)
  • packages/rs-sdk/src/sdk.rs (15 hunks)
🧰 Additional context used
🔇 Additional comments (6)
packages/rs-sdk/src/error.rs (2)

8-8: LGTM: Import statement updated correctly.

The import statement has been appropriately updated to include CanRetry along with DapiClientError. This change aligns with the new CanRetry trait implementation for the Error enum.


Line range hint 1-125: Summary: Excellent implementation of stale node detection and error handling.

The changes in this file effectively implement the stale node detection mechanism as described in the PR objectives. The new StaleNodeError enum, along with the StaleNode variant in the Error enum, provides a robust way to handle and report stale node issues. The implementation of the CanRetry trait enhances the SDK's ability to handle retryable errors gracefully.

These modifications will significantly improve the SDK's resilience when dealing with nodes that may have experienced a chain halt or are otherwise out of sync. The clear error messages and detailed error information will aid in debugging and maintaining the system.

Overall, the changes are well-implemented, properly documented, and align closely with the PR objectives. Great job!

packages/rs-dapi-client/src/dapi_client.rs (2)

42-47: Implementation of can_retry() is appropriate

The can_retry() method for DapiClientError correctly determines the retryability of different error types, which is essential for the retry logic.


272-274: Logging non-retryable errors enhances observability

Appropriately logging errors that are not retryable helps in monitoring and diagnosing issues that cannot be automatically recovered.

packages/rs-sdk/src/sdk.rs (2)

134-136: Ensure Proper Cloning of New Fields in Clone Implementation

The Clone implementation now includes cloning of the new fields metadata_last_seen_height, metadata_height_tolerance, and metadata_time_tolerance_ms. Verify that these fields are correctly cloned to prevent unintended sharing or concurrency issues, especially with the atomic variable metadata_last_seen_height.


965-969: Initialize metadata_last_seen_height Appropriately

When initializing metadata_last_seen_height with AtomicU64::new(0), consider whether starting at zero is appropriate. If zero is not a valid or meaningful height, this could affect staleness checks. Ensure that the initial value aligns with the expected behavior when the SDK starts operating.

packages/rs-sdk/src/sdk.rs Show resolved Hide resolved
packages/rs-sdk/src/sdk.rs Show resolved Hide resolved
@lklimek lklimek requested a review from shumkov October 18, 2024 15:49
@shumkov shumkov changed the title feat(sdk): detect stale proofs received from nodes feat(sdk): detect stale nodes Oct 18, 2024
Copy link
Member

@shumkov shumkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
packages/rs-sdk/src/error.rs (3)

70-73: LGTM: New StaleNode variant added correctly.

The addition of the StaleNode variant to the Error enum is appropriate and aligns with the PR objective of detecting stale nodes. The use of #[from] attribute for automatic conversion from StaleNodeError is a good practice.

Consider updating the error message to be more specific:

-    #[error(transparent)]
+    #[error("Stale node detected: {0}")]
     StaleNode(#[from] StaleNodeError),

This change would provide more context when the error is displayed.


94-119: LGTM: StaleNodeError enum is well-designed and informative.

The new StaleNodeError enum is a great addition, providing specific information about stale node errors. The separate variants for Height and Time allow for precise error reporting, and the error messages are descriptive and include all necessary information.

Consider moving the StaleNodeError enum definition above the Error enum in the file. This would improve code organization by defining the more specific error type before it's used in the main Error enum.

// Move this block above the Error enum definition
/// Server returned stale metadata
#[derive(Debug, thiserror::Error)]
pub enum StaleNodeError {
    // ... (rest of the enum definition)
}

// Then keep the Error enum as is
/// Error type for the SDK
#[derive(Debug, thiserror::Error)]
pub enum Error {
    // ... (rest of the enum definition)
}

This change would enhance code readability and follow the convention of defining more specific types before their usage.


Line range hint 1-119: Overall, the changes look great and achieve the PR objectives.

The modifications to error.rs successfully introduce new error types and handling mechanisms for stale nodes, aligning well with the PR objectives. The code is well-structured, follows Rust best practices, and improves error handling by providing more detailed information about stale node errors.

To further enhance the error handling capabilities, consider adding a logging mechanism for these errors. This could help with debugging and monitoring in production environments. For example:

use log::error;

impl Error {
    pub fn log(&self) {
        match self {
            Error::StaleNode(stale_error) => {
                error!("Stale node detected: {}", stale_error);
            }
            // Add other error types as needed
            _ => {
                error!("An error occurred: {}", self);
            }
        }
    }
}

This addition would allow for consistent logging of errors throughout the SDK, which could be valuable for troubleshooting and monitoring.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 46aeeee and 9795df7.

📒 Files selected for processing (2)
  • packages/rs-dapi-client/src/dapi_client.rs (3 hunks)
  • packages/rs-sdk/src/error.rs (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/rs-dapi-client/src/dapi_client.rs
🧰 Additional context used
🔇 Additional comments (2)
packages/rs-sdk/src/error.rs (2)

8-8: LGTM: Import statement updated correctly.

The import statement has been properly updated to include the CanRetry trait, which is necessary for the new functionality being added to the Error enum.


88-92: Implementation looks good, but let's revisit the retry logic.

The implementation of the CanRetry trait for the Error enum is well-structured and allows for more nuanced error handling. However, there's a potential concern regarding which errors should be retryable.

Based on a previous comment by shumkov, there was a suggestion to limit retries to Error::StaleNode and Error::TimeoutReached. The current implementation follows this suggestion. However, we should verify if this is still the desired behavior or if we need to adjust the retry logic for other error types.

Could you confirm if the current retry logic aligns with the team's latest decision on error handling?

@shumkov shumkov merged commit b0fa337 into v1.4-dev Oct 19, 2024
30 checks passed
@shumkov shumkov deleted the feat/sdk-verify-mtd-height branch October 19, 2024 05:31
@shumkov shumkov changed the title feat(sdk): detect stale nodes feat(sdk)!: detect stale nodes Oct 19, 2024
@shumkov shumkov assigned shumkov and unassigned shumkov Oct 19, 2024
@shumkov shumkov added this to the 1.4.2 milestone Oct 19, 2024
@coderabbitai coderabbitai bot mentioned this pull request Oct 22, 2024
6 tasks
@coderabbitai coderabbitai bot mentioned this pull request Nov 13, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants