Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: "group 0 concurrent modification" #1126

Closed
Lorak-mmk opened this issue Nov 20, 2024 · 5 comments · Fixed by #1127
Closed

CI: "group 0 concurrent modification" #1126

Lorak-mmk opened this issue Nov 20, 2024 · 5 comments · Fixed by #1127
Assignees
Labels
area/testing Related to tests - unit, integration, or even out of repo bug Something isn't working symptom/ci stability Related to continuous integration

Comments

@Lorak-mmk
Copy link
Collaborator

New versions of Scylla use Raft for schema / topology operations.
Raft serializes those operations internally. Concurrent ddl operations are done similarly to optimistic locking:
they basically work like "if raft state is X, change schema that way (to state Y)".
If 2 such operations execute at the same time, then one of them will fail (because the state is no longer X).
Scylla server will restart such operations in that case, but the amount of restarts is limited to 10.
If there are many concurrent DDL queries, then its possible to hit this limit, and it happens very often in our tests,
which makes CI unstable.

There are several solutions:

  • Limit tests concurrency to 1. This makes the tests multiple times slower, so we should avoid it.
  • Have a global mutex for DDL changes
  • Send DDL changes always to the same node. This works because Scylla handles this case internally.

Third solution should be best performance-wise because it doesn't introduce any additional locking and just lets Scylla handle this case.

The way to do this is to introduce a helper function in test utils that sends a DDL query.
This function would either:

  • take a mutex (second solution)
  • set a load balancing policy to send the query to correct node (third solution).
@Lorak-mmk Lorak-mmk added area/testing Related to tests - unit, integration, or even out of repo bug Something isn't working symptom/ci stability Related to continuous integration labels Nov 20, 2024
@muzarski
Copy link
Contributor

The third option could also use #1031 once it's implemented.

@Lorak-mmk
Copy link
Collaborator Author

The third option could also use #1031 once it's implemented.

I'm currently working on this issue. The way I see the LBP is this:

// This LBP produces a predictable query plan - it order the nodes
// by position in the ring.
// This is to make sure that all DDL queries land on the same node,
// to prevent errors from concurrent DDL queries executed on different nodes.
#[derive(Debug)]
struct SchemaQueriesLBP {}

impl LoadBalancingPolicy for SchemaQueriesLBP {
    fn pick<'a>(
        &'a self,
        _query: &'a scylla::load_balancing::RoutingInfo,
        cluster: &'a scylla::transport::ClusterData,
    ) -> Option<(
        scylla::transport::NodeRef<'a>,
        Option<scylla::routing::Shard>,
    )> {
        // I'm not sure if Scylla can handle concurrent DDL queries to different shard,
        // in other words if its local lock is per-node or per shard.
        // Just to be safe, let's use explicit shard.
        cluster.get_nodes_info().first().map(|node| (node, Some(0)))
    }

    fn fallback<'a>(
        &'a self,
        _query: &'a scylla::load_balancing::RoutingInfo,
        cluster: &'a scylla::transport::ClusterData,
    ) -> scylla::load_balancing::FallbackPlan<'a> {
        Box::new(cluster.get_nodes_info().iter().map(|node| (node, Some(0))))
    }

    fn name(&self) -> String {
        todo!()
    }
}

WDYT @muzarski ?

@Lorak-mmk Lorak-mmk self-assigned this Nov 20, 2024
@muzarski
Copy link
Contributor

The third option could also use #1031 once it's implemented.

I'm currently working on this issue. The way I see the LBP is this:

// This LBP produces a predictable query plan - it order the nodes
// by position in the ring.
// This is to make sure that all DDL queries land on the same node,
// to prevent errors from concurrent DDL queries executed on different nodes.
#[derive(Debug)]
struct SchemaQueriesLBP {}

impl LoadBalancingPolicy for SchemaQueriesLBP {
fn pick<'a>(
&'a self,
_query: &'a scylla::load_balancing::RoutingInfo,
cluster: &'a scylla::transport::ClusterData,
) -> Option<(
scylla::transport::NodeRef<'a>,
Optionscylla::routing::Shard,
)> {
// I'm not sure if Scylla can handle concurrent DDL queries to different shard,
// in other words if its local lock is per-node or per shard.
// Just to be safe, let's use explicit shard.
cluster.get_nodes_info().first().map(|node| (node, Some(0)))
}

fn fallback<'a>(
    &'a self,
    _query: &'a scylla::load_balancing::RoutingInfo,
    cluster: &'a scylla::transport::ClusterData,
) -> scylla::load_balancing::FallbackPlan<'a> {
    Box::new(cluster.get_nodes_info().iter().map(|node| (node, Some(0))))
}

fn name(&self) -> String {
    todo!()
}

}

WDYT @muzarski ?

So from my understanding, if the query fails, the retry might be sent to the next node in the ring (based on the retry decision), correct? Is it ok this way? Maybe we need additional custom retry policy that always retries on the same node. I'm not really sure, though - it might be the case that the first node is overloaded, then it would make sense to try on another one.

@Lorak-mmk
Copy link
Collaborator Author

So from my understanding, if the query fails, the retry might be sent to the next node in the ring (based on the retry decision), correct? Is it ok this way? Maybe we need additional custom retry policy that always retries on the same node. I'm not really sure, though - it might be the case that the first node is overloaded, then it would make sense to try on another one.

If we don't want to use next node, then it's enough to change fallback to be the same as pick. If it returns only one node then no retry policy will be able to use another one.
I don't have strong feelings on this matter. Currently all the nodes are up in tests, so it shouldn't matter.
I imagine in the future we could have tests that operate on cluster with some nodes down, but either:

  • the state of the cluster is stable (meaning the same nodes are down) all the time, then it can be shared between tests
  • test changes the state, in which case it must have exclusive access to the cluster

In both cases the version I posted above works correctly.

@muzarski
Copy link
Contributor

muzarski commented Nov 20, 2024

Then I'm fine with the LBP you provided. It should significantly reduce the probability of this error's occurrence. If the error ever occures in the future, we can adjust the LBP to return the same node in the fallback as you mentioned (but for now, let's stick to the current version).

Lorak-mmk added a commit to Lorak-mmk/scylla-rust-driver that referenced this issue Nov 20, 2024
This commit changes all DDL queries to use helper functions introduced
in previous commit. Should fix scylladb#1126
Lorak-mmk added a commit to Lorak-mmk/scylla-rust-driver that referenced this issue Nov 20, 2024
This commit changes all DDL queries to use helper functions introduced
in previous commit. Should fix scylladb#1126
Lorak-mmk added a commit to Lorak-mmk/scylla-rust-driver that referenced this issue Nov 21, 2024
This commit changes all DDL queries to use helper functions introduced
in previous commit. Should fix scylladb#1126
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Related to tests - unit, integration, or even out of repo bug Something isn't working symptom/ci stability Related to continuous integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants