error: use explicit matches for errors when the decision is crucial for correctness/performance #1083

muzarski · 2024-10-04T11:57:31Z

Since last time, during error refactor I introduced a silent bug to the code (#1075), I'd like to prevent that from happening in the future. This is why I replace _ matches with explicit error variants.
This way, if some developer introduces a new variant to QueryError, the compiler will complain about it in multiple places, forcing the developer to adjust match expressions there.

Notice that LatencyAwareness::reliable_latency_measure matched errors this way even before this PR.

Discussion:

I think it's a great place to discuss what decision should be made based on the received error in following modules/places:

LatencyAwareness::reliable_latency_measure
receiving USE KEYSPACE result from a single connection/node (see use_keyspace_result from cluster.rs)
DefaultRetrySession::decide_should_retry (same for DowngradingConsistencyRetrySession)
deciding whether the error received after speculative execution should be ignored

I did not change the logic at all yet, simply transformed the code to use explicit match expressions.
My first suspect is QueryError::TimeoutError in can_be_ignored in speculative_execution module. I believe we should
return true for this error only iff we return true for QueryError::RequestTimeout.

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
~~[ ] I added relevant tests for new features and bug fixes.~~
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
~~[ ] I have provided docstrings for the public items that I want to introduce.~~
~~[ ] I have adjusted the documentation in ./docs/source/.~~
I added appropriate Fixes: annotations to PR description.

github-actions · 2024-10-04T12:02:31Z

cargo semver-checks found no API-breaking changes in this PR! 🎉🥳
Checked commit: c400b5f

scylla/src/transport/downgrading_consistency_retry_policy.rs

wprzytula · 2024-10-07T11:11:27Z

scylla/src/transport/downgrading_consistency_retry_policy.rs

            // In all other cases propagate the error to the user
-            _ => RetryDecision::DontRetry,
+            QueryError::BadQuery(_)
+            | QueryError::MetadataError(_)
+            | QueryError::CqlRequestSerialization(_)
+            | QueryError::BodyExtensionsParseError(_)
+            | QueryError::EmptyPlan
+            | QueryError::CqlResultParseError(_)
+            | QueryError::CqlErrorParseError(_)
+            | QueryError::ProtocolError(_)
+            | QueryError::TimeoutError


Should we really propagate TimeoutError instead of trying another node? Well, it's the weird error about use_keyspace...

Funny thing is that this error variant can't even appear there. AFAIU, use keyspace requests go through different path than regular requests. They are not a subject to retries, since we send use keyspace requests to all nodes simultaneously.

In general, there are some errors that can't possibly appear in some places (same for speculative_execution::can_be_ignored). This is because QueryError is a really broad error type, whose specific variants are constructed in multiple driver layers. IMO, the only way to fix this, is to narrow the internal return error types as much as possible. I'm not sure how much work it would require, though.

scylla/src/transport/retry_policy.rs

scylla/src/transport/errors.rs

Lorak-mmk · 2024-10-08T11:28:18Z

ref: #519

Since last time, during error refactor I introduced a silent bug to the code (#1075), I'd like to prevent that from happening in the future. This is why I replace _ matches with explicit error variants. This way, if some developer introduces a new variant to QueryError, the compiler will complain about it in multiple places, forcing the developer to adjust match expressions there.

Idea: places where explicit match should be used could use a comment about it, and perhaps this lint: https://rust-lang.github.io/rust-clippy/master/index.html#/wildcard_enum_match_arm
That way a future contributor won't simplify the code by using wildcard.

Notice that LatencyAwareness::reliable_latency_measure matched errors this way even before this PR.

Discussion:

I think it's a great place to discuss what decision should be made based on the received error in following modules/places:

The most important thing is to estabilish a rule for each of the following places, so that contributors know how to categorize possible future variants - and perhaps fix bugs in categorization that we do.
General question: are we sure that if connection is broken then we always receive ConnectionBroken variant? It's not possible to receive some deserialization error because of stream ending too early, right?
I will assume that we always receive ConnectionBroken in that case - and also in the case that we can't write to socket because it was closed.

* `LatencyAwareness::reliable_latency_measure`

The rule in the comments is imo reasonable, but a bit imprecise.
Some errors are not always fast or slow - and the comments don't specify where they should go.
I suspect that "slow" errors should be only those that are always "slow" - otherwise they can't be a reliable latency measure.

Another issue: what does it mean for the query to be "slow" here?
I don't get why "DbError::IsBootstrapping", "DbError::Unprepared" or "DbError::Unavailable" are fast, but "DbError::SyntaxError" or "DbError::Invalid" (and some others) are "slow".

Note: "DbError" in the slow branch should also not use wildcard.

* receiving `USE KEYSPACE` result from a single connection/node (see `use_keyspace_result` from cluster.rs)

The rule in the comment of that function looks reasonable to me, and the current filtering matches this rule.
The invariant is that after the function returns all connections are either broken or will have new keyspace set.
If this invariant was kept in the upper layers then the behavior would be good: after Session.use_keyspace returns all new queries will be issued on connections with the new keyspace.
However it looks like this invariant is not kept in the upper layers - PoolRefiller.use_keyspace doesn't close connections that returned an error.

* `DefaultRetrySession::decide_should_retry` (same for `DowngradingConsistencyRetrySession`)

Those are complicated - I don't think there can exist a general rule. I'll look at them later.

* deciding whether the error received after speculative execution should be ignored

The rule here seems to be the we can ignore errors if the presence of the error doesn't mean other speculative tries will fail too.
As I mention below I think it is not possible to receive QueryError::TimeoutError or QueryError::RequestTimeout here - so we should not ignore them so they are propagated to the user if the assumption turns out to be wrong.

QueryError::DbError:
- SyntaxError - can't ignore
- Invalid - I'm not sure. When can it happen?
- AlreadyExists - can't ignore
- FunctionFailure - not sure if UDF failure is transient. @wmitros ?
- AuthenticationError - Can this happen? I think not. We can't ignore it anyway
- Unauthorized - Can't ignore
- ConfigError - not sure. When can this happen?
- Unavailable - may be a split brain situation etc. I think we can ignore.
- Overloaded - We can ignore
- IsBootstrapping - can ignore
- TruncateError - not sure. When can this happen?
- ReadTimeout - can ignore
- WriteTimeout - can ignore
- ReadFailure - I'm not sure when can this happen. Probably can ignore
- WriteFailure - as above
- Unprepared - hmm... this can happen if we tried to reprepare, but it is still unprepared. I think we can ignore - it may be preparable on another node
- ServerError - I think we can ignore, but I'm not sure
- ProtocolError - why is there protocol error here too? The one in QueryError is not enough? Anyway, we probably can't ignore
- RateLimitReached - Can ignore
- Other - this shouldn't happen, right? I think we should not ignore
QueryError::BadQuery - we can't ignore because query is bad always or never
QueryError::CqlRequestSerialization - same as above
QueryError::BodyExtensionsParseError - if we get parse / deserialization error then something went terribly wrong, no need to retry on another node
QueryError::CqlResultParseError - same as above
QueryError::CqlErrorParseError - same as above
QueryError::ConnectionPoolError - we can ignore, we just don't have connections to given node
QueryError::ProtocolError - we can't ignore, same as with serialization / deserialization error
QueryError::InvalidMessage - I don't know when it happens. Assuming it's similar to ProtocolError we can't ignore it.
QueryError::TimeoutError - shouldn't happen, so we can't ignore
QueryError::BrokenConnection - we can ignore
QueryError::UnableToAllocStreamId - we can ignore
QueryError::RequestTimeout - shouldn't happen, so we can't ignore

I did not change the logic at all yet, simply transformed the code to use explicit match expressions. My first suspect is QueryError::TimeoutError in can_be_ignored in speculative_execution module. I believe we should return true for this error only iff we return true for QueryError::RequestTimeout.

I'm fairly sure that RequestTimeout is applied to the whole execution, not individual speculative fibers, so it is not possible to receive it there.

muzarski · 2024-10-08T12:03:11Z

It's not possible to receive some deserialization error because of stream ending too early, right?

That's right. We always read the whole frame, and only then we start to deserialize the data. The only exception is frame header deserialization (see FrameHeaderParseError), but if such error appears, we close the connection anyway (i.e. it's included in BrokenConnectionError).

The rule in the comments is imo reasonable, but a bit imprecise. Some errors are not always fast or slow - and the comments don't specify where they should go. I suspect that "slow" errors should be only those that are always "slow" - otherwise they can't be a reliable latency measure.

Another issue: what does it mean for the query to be "slow" here? I don't get why "DbError::IsBootstrapping", "DbError::Unprepared" or "DbError::Unavailable" are fast, but "DbError::SyntaxError" or "DbError::Invalid" (and some others) are "slow".

When I first looked at this code, my first was thought that "fast" errors are the errors that appear on driver's side, before sending the request to the server. I'm not sure why the DbErrors you mentioned are "fast" as well.

Invalid - I'm not sure. When can it happen?

An example of this error that appears in cpp-rust-driver integration tests:

Invalid query: Database returned an error: 
The query is syntactically correct but invalid, 
Error message: No keyspace has been specified. 
USE a keyspace, or explicitly specify 
keyspace.tablename

I believe this should not be ignored - it will fail on other nodes as well.

ProtocolError - why is there protocol error here too? The one in QueryError is not enough? Anyway, we probably can't ignore

This is a protocol error that was recognized by the server. The QueryError::ProtocolError are the protocol errors recognized by the driver. Anyway, I agree that we should not ignore it.

QueryError::InvalidMessage - I don't know when it happens. Assuming it's similar to ProtocolError we can't ignore it.

I removed this variant during protocol error refactor.

Lorak-mmk · 2024-10-08T12:07:51Z

When I first looked at this code, my first was thought that "fast" errors are the errors that appear on driver's side, before sending the request to the server. I'm not sure why the DbErrors you mentioned are "fast" as well.

My first thought was that "fast" errors are errors on the driver side and error on server side that happen before it tries to contact other nodes (so can reply quickly, without actually performing the query).
I'm not sure this is correct.

Invalid - I'm not sure. When can it happen?

An example of this error that appears in cpp-rust-driver integration tests:
Invalid query: Database returned an error: 
The query is syntactically correct but invalid, 
Error message: No keyspace has been specified. 
USE a keyspace, or explicitly specify 
keyspace.tablename
I believe this should not be ignored - it will fail on other nodes as well.

I agree

muzarski · 2024-10-08T14:35:14Z

v2:

rebased on main (I apologize for that - should be done in a separate push, but I forgot that I already pulled some changes locally)
enabled wildcard_enum_match_arm lint in some places (including latency awareness)
using explicit matches for DbErrors (including latency awareness)
fixed can_be_ignored function in speculative_execution module

Since last time, during error refactor I introduced a silent bug to the code (scylladb#1075), I'd like to prevent that from happening in the future. This is why we replace a `_` match with explicit error variants in `reliable_latency_measure` function. We also enable the `wildcard_enum_match_arm` clippy lint here.

Since last time, during error refactor I introduced a silent bug to the code (scylladb#1075), I'd like to prevent that from happening in the future. This is why we replace a `_` match with explicit error variants in retry policy modules. We also enable `wildcard_enum_match_arm` clippy lint in this place for QueryError and DbError matches.

Since last time, during error refactor I introduced a silent bug to the code (scylladb#1075), I'd like to prevent that from happening in the future. This is why we replace a `_` match with explicit error variants when deciding if error received after `USE KEYSPACE` should be ignored. We also enable the `wildcard_enum_match_arm` clippy lint to disallow using `_` matches.

Since last time, during error refactor I introduced a silent bug to the code (scylladb#1075), I'd like to prevent that from happening in the future. This is why we replace a `_` match with explicit error variants when deciding if error received from speculative execution should be ignored. We also enabled the `wildcard_enum_match_arm` clippy lint.

Previously, `can_be_ignored` function would return `true` for some weird error variants. I adjusted the implementation, and justified the decision for each error variant in the comments.

muzarski self-assigned this Oct 4, 2024

muzarski requested review from wprzytula and Lorak-mmk October 4, 2024 12:00

muzarski changed the title ~~Internal error matches no wildcard~~ error: use explicit matches for errors when the decision is crucial for correctness/performance Oct 4, 2024

wprzytula reviewed Oct 7, 2024

View reviewed changes

muzarski force-pushed the internal-error-matches-no-wildcard branch from e09334f to 5bd35ff Compare October 8, 2024 14:30

muzarski requested a review from wprzytula October 8, 2024 15:18

muzarski added 5 commits October 14, 2024 15:51

speculative_execution: fix can_be_ignored implementation

c400b5f

Previously, `can_be_ignored` function would return `true` for some weird error variants. I adjusted the implementation, and justified the decision for each error variant in the comments.

muzarski force-pushed the internal-error-matches-no-wildcard branch from 5bd35ff to c400b5f Compare October 14, 2024 13:51

muzarski added this to the 0.15.0 milestone Oct 28, 2024

wprzytula approved these changes Nov 13, 2024

View reviewed changes

muzarski mentioned this pull request Nov 13, 2024

spec_exec: fix can_be_ignored function #1124

Merged

8 tasks

muzarski removed this from the 0.15.0 milestone Nov 13, 2024

wprzytula mentioned this pull request Nov 25, 2024

bump rust-driver to 0.15 (with eager deserialization) scylladb/cpp-rust-driver#206

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error: use explicit matches for errors when the decision is crucial for correctness/performance #1083

error: use explicit matches for errors when the decision is crucial for correctness/performance #1083

muzarski commented Oct 4, 2024 •

edited

Loading

github-actions bot commented Oct 4, 2024 •

edited

Loading

wprzytula Oct 7, 2024

muzarski Oct 8, 2024

muzarski Oct 8, 2024 •

edited

Loading

Lorak-mmk commented Oct 8, 2024

Discussion:

muzarski commented Oct 8, 2024

Lorak-mmk commented Oct 8, 2024

muzarski commented Oct 8, 2024

error: use explicit matches for errors when the decision is crucial for correctness/performance #1083

Are you sure you want to change the base?

error: use explicit matches for errors when the decision is crucial for correctness/performance #1083

Conversation

muzarski commented Oct 4, 2024 • edited Loading

Discussion:

github-actions bot commented Oct 4, 2024 • edited Loading

wprzytula Oct 7, 2024

Choose a reason for hiding this comment

muzarski Oct 8, 2024

Choose a reason for hiding this comment

muzarski Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Lorak-mmk commented Oct 8, 2024

Discussion:

muzarski commented Oct 8, 2024

Lorak-mmk commented Oct 8, 2024

muzarski commented Oct 8, 2024

muzarski commented Oct 4, 2024 •

edited

Loading

github-actions bot commented Oct 4, 2024 •

edited

Loading

muzarski Oct 8, 2024 •

edited

Loading