CI: "group 0 concurrent modification" #1126

Lorak-mmk · 2024-11-20T15:28:57Z

New versions of Scylla use Raft for schema / topology operations.
Raft serializes those operations internally. Concurrent ddl operations are done similarly to optimistic locking:
they basically work like "if raft state is X, change schema that way (to state Y)".
If 2 such operations execute at the same time, then one of them will fail (because the state is no longer X).
Scylla server will restart such operations in that case, but the amount of restarts is limited to 10.
If there are many concurrent DDL queries, then its possible to hit this limit, and it happens very often in our tests,
which makes CI unstable.

There are several solutions:

Limit tests concurrency to 1. This makes the tests multiple times slower, so we should avoid it.
Have a global mutex for DDL changes
Send DDL changes always to the same node. This works because Scylla handles this case internally.

Third solution should be best performance-wise because it doesn't introduce any additional locking and just lets Scylla handle this case.

The way to do this is to introduce a helper function in test utils that sends a DDL query.
This function would either:

take a mutex (second solution)
set a load balancing policy to send the query to correct node (third solution).

muzarski · 2024-11-20T15:44:28Z

The third option could also use #1031 once it's implemented.

Lorak-mmk · 2024-11-20T15:55:54Z

The third option could also use #1031 once it's implemented.

I'm currently working on this issue. The way I see the LBP is this:

// This LBP produces a predictable query plan - it order the nodes
// by position in the ring.
// This is to make sure that all DDL queries land on the same node,
// to prevent errors from concurrent DDL queries executed on different nodes.
#[derive(Debug)]
struct SchemaQueriesLBP {}

impl LoadBalancingPolicy for SchemaQueriesLBP {
    fn pick<'a>(
        &'a self,
        _query: &'a scylla::load_balancing::RoutingInfo,
        cluster: &'a scylla::transport::ClusterData,
    ) -> Option<(
        scylla::transport::NodeRef<'a>,
        Option<scylla::routing::Shard>,
    )> {
        // I'm not sure if Scylla can handle concurrent DDL queries to different shard,
        // in other words if its local lock is per-node or per shard.
        // Just to be safe, let's use explicit shard.
        cluster.get_nodes_info().first().map(|node| (node, Some(0)))
    }

    fn fallback<'a>(
        &'a self,
        _query: &'a scylla::load_balancing::RoutingInfo,
        cluster: &'a scylla::transport::ClusterData,
    ) -> scylla::load_balancing::FallbackPlan<'a> {
        Box::new(cluster.get_nodes_info().iter().map(|node| (node, Some(0))))
    }

    fn name(&self) -> String {
        todo!()
    }
}

WDYT @muzarski ?

muzarski · 2024-11-20T16:07:34Z

The third option could also use #1031 once it's implemented.

I'm currently working on this issue. The way I see the LBP is this:

// This LBP produces a predictable query plan - it order the nodes
// by position in the ring.
// This is to make sure that all DDL queries land on the same node,
// to prevent errors from concurrent DDL queries executed on different nodes.
#[derive(Debug)]
struct SchemaQueriesLBP {}

impl LoadBalancingPolicy for SchemaQueriesLBP {
fn pick<'a>(
&'a self,
_query: &'a scylla::load_balancing::RoutingInfo,
cluster: &'a scylla::transport::ClusterData,
) -> Option<(
scylla::transport::NodeRef<'a>,
Optionscylla::routing::Shard,
)> {
// I'm not sure if Scylla can handle concurrent DDL queries to different shard,
// in other words if its local lock is per-node or per shard.
// Just to be safe, let's use explicit shard.
cluster.get_nodes_info().first().map(|node| (node, Some(0)))
}
fn fallback<'a>(
    &'a self,
    _query: &'a scylla::load_balancing::RoutingInfo,
    cluster: &'a scylla::transport::ClusterData,
) -> scylla::load_balancing::FallbackPlan<'a> {
    Box::new(cluster.get_nodes_info().iter().map(|node| (node, Some(0))))
}

fn name(&self) -> String {
    todo!()
}
}

WDYT @muzarski ?

So from my understanding, if the query fails, the retry might be sent to the next node in the ring (based on the retry decision), correct? Is it ok this way? Maybe we need additional custom retry policy that always retries on the same node. I'm not really sure, though - it might be the case that the first node is overloaded, then it would make sense to try on another one.

Lorak-mmk · 2024-11-20T16:17:43Z

So from my understanding, if the query fails, the retry might be sent to the next node in the ring (based on the retry decision), correct? Is it ok this way? Maybe we need additional custom retry policy that always retries on the same node. I'm not really sure, though - it might be the case that the first node is overloaded, then it would make sense to try on another one.

If we don't want to use next node, then it's enough to change fallback to be the same as pick. If it returns only one node then no retry policy will be able to use another one.
I don't have strong feelings on this matter. Currently all the nodes are up in tests, so it shouldn't matter.
I imagine in the future we could have tests that operate on cluster with some nodes down, but either:

the state of the cluster is stable (meaning the same nodes are down) all the time, then it can be shared between tests
test changes the state, in which case it must have exclusive access to the cluster

In both cases the version I posted above works correctly.

muzarski · 2024-11-20T16:25:25Z

Then I'm fine with the LBP you provided. It should significantly reduce the probability of this error's occurrence. If the error ever occures in the future, we can adjust the LBP to return the same node in the fallback as you mentioned (but for now, let's stick to the current version).

This commit changes all DDL queries to use helper functions introduced in previous commit. Should fix scylladb#1126

Lorak-mmk added area/testing Related to tests - unit, integration, or even out of repo bug Something isn't working symptom/ci stability Related to continuous integration labels Nov 20, 2024

Lorak-mmk self-assigned this Nov 20, 2024

Lorak-mmk added a commit to Lorak-mmk/scylla-rust-driver that referenced this issue Nov 20, 2024

Tests: Use DDL helper functions

d5b99ba

This commit changes all DDL queries to use helper functions introduced in previous commit. Should fix scylladb#1126

Lorak-mmk mentioned this issue Nov 20, 2024

Fix concurrent DDL queries #1127

Merged

8 tasks

Lorak-mmk added a commit to Lorak-mmk/scylla-rust-driver that referenced this issue Nov 20, 2024

Tests: Use DDL helper functions

74f20f2

This commit changes all DDL queries to use helper functions introduced in previous commit. Should fix scylladb#1126

Lorak-mmk added a commit to Lorak-mmk/scylla-rust-driver that referenced this issue Nov 21, 2024

Tests: Use DDL helper functions

0a382f8

This commit changes all DDL queries to use helper functions introduced in previous commit. Should fix scylladb#1126

Lorak-mmk closed this as completed in #1127 Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: "group 0 concurrent modification" #1126

CI: "group 0 concurrent modification" #1126

Lorak-mmk commented Nov 20, 2024

muzarski commented Nov 20, 2024

Lorak-mmk commented Nov 20, 2024

muzarski commented Nov 20, 2024

Lorak-mmk commented Nov 20, 2024

muzarski commented Nov 20, 2024 •

edited

Loading

CI: "group 0 concurrent modification" #1126

CI: "group 0 concurrent modification" #1126

Comments

Lorak-mmk commented Nov 20, 2024

muzarski commented Nov 20, 2024

Lorak-mmk commented Nov 20, 2024

muzarski commented Nov 20, 2024

Lorak-mmk commented Nov 20, 2024

muzarski commented Nov 20, 2024 • edited Loading

muzarski commented Nov 20, 2024 •

edited

Loading