Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateful partitioners #758

Merged
merged 14 commits into from
Aug 22, 2023
Merged

Stateful partitioners #758

merged 14 commits into from
Aug 22, 2023

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Jul 7, 2023

Motivation

Currently, partitioners require the whole input to be passed in a single contiguous slice, which often forces an allocation that could else be avoided.

What's done

  1. Partitioners' API is refactored so it resembles std::hash API:
    • Partitioner corresponds to BuildHasher as it becomes kind of a factory,
    • PartitionerHasher is introduced and corresponds to Hasher. It contains the hashing state, which enables feeding it with multiple chunks of data separately.
      A test is added that ensures consistent hashing result no matter how the partition key is partitioned into chunks.
  2. After the partitioners gain ability to partition by chunks, the codebase is modified so that all needless allocations of partition key are avoided, i.e., allocations that only served the purpose of calculating token.
    To this end, PartitionKey struct is introduced, whose motivation is as follows: the algorithm for computing partition key for a given prepared statement and bound values can be divided into steps. The first of them involves extracting values that consistute the partition key and putting them in proper partition key order. PartitionKey performs this step on construction, and serves as an entry point for further steps, with token calculation among others. For those further steps to be done, an iterator over partition key values is accessible, which yields pairs (value, spec) for each column that constitutes partition key, in partition key order.
  3. As there was one more place where materialised partition key was used - by decoding it into the produced trace - an adaptor is introduced that operates on PartitionKey's iterator provides values to be deserialised by the mechanism added recently in session: include partition key in RequestSpan in human readable form #766. Allocation is then saved there, too.
  4. An API cleanup is done regarding partition key and token computation. calculate_token_from_partition_key() is moved to partitioner.rs, calculate_token() to prepared_statement.rs, and a number of helper functions are deleted as no longer needed.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • [ ] I have adjusted the documentation in ./docs/source/.
  • [ ] I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula requested review from piodul and cvybhu July 7, 2023 15:58
@mykaul
Copy link
Contributor

mykaul commented Jul 9, 2023

Any measurable performance impact?

Copy link

@avelanarius avelanarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a part of this change, you should also tackle issues with public API for calculating tokens.

Currently, there are some problems with that: There are 3 functions to calculate Token: Session::calculate_token, Session::calculate_token_for_partition_key and ClusterData::compute_token. It doesn't make sense to put them inside Session or ClusterData - wrong level of abstraction. A similar case for functions to calculate partition key.

More context: #757 (comment)
My initial exploration on fixing it: https://github.com/avelanarius/scylla-rust-driver/tree/token_calc_cleanup

@wprzytula
Copy link
Collaborator Author

Any measurable performance impact?

Number of allocations has dropped (6 -> 5 per insert, 19 -> 18 per select).

Before:

Args { mode: All, node: "127.0.0.1:9042", parallelism: 256, requests: 100000 }
Connecting to 127.0.0.1:9042 ...
Sending 100000 inserts, hold tight ..........
----------
Inserts:
----------
allocs/req:                 6.08
reallocs/req:               6.00
frees/req:                  6.07
bytes allocated/req:      272.75
bytes reallocated/req:     73.06
bytes freed/req:          267.77
----------
Sending 100000 selects, hold tight ..........
----------
Selects:
----------
allocs/req:                19.00
reallocs/req:               6.00
frees/req:                 19.00
bytes allocated/req:     1123.09
bytes reallocated/req:     73.00
bytes freed/req:         1123.03
----------

After:

Args { mode: All, node: "127.0.0.1:9042", parallelism: 256, requests: 100000 }
Connecting to 127.0.0.1:9042 ...
Sending 100000 inserts, hold tight ..........
----------
Inserts:
----------
allocs/req:                 5.08
reallocs/req:               5.00
frees/req:                  5.07
bytes allocated/req:      265.09
bytes reallocated/req:     77.06
bytes freed/req:          260.06
----------
Sending 100000 selects, hold tight ..........
----------
Selects:
----------
allocs/req:                18.00
reallocs/req:               5.00
frees/req:                 18.00
bytes allocated/req:     1115.19
bytes reallocated/req:     77.00
bytes freed/req:         1115.05
----------

@piodul
Copy link
Collaborator

piodul commented Jul 24, 2023

Any measurable performance impact?

Number of allocations has dropped (6 -> 5 per insert, 19 -> 18 per select).

Please prepare a microbenchmark for this. While an allocation is now avoided, the new implementation has more branches, so it will be interesting to see whether the tradeoff was worth it.

We are basically interested in the PreparedStatement::compute_partition_key function. Interesting cases are single-column pks with different lengths and multi-column pks with different lengths.

We already have a small amount of benchmarks in scylla and scylla-cql, you can look at them for reference (and at the external criterion crate that those benchmarks use).

@wprzytula
Copy link
Collaborator Author

Please prepare a microbenchmark for this. While an allocation is now avoided, the new implementation has more branches, so it will be interesting to see whether the tradeoff was worth it.

After benchmarks and some changes to the algorithm, the new way seems to outperform the old way by 15% to 40% in various cases. I believe this suffices to merge this.

@mykaul
Copy link
Contributor

mykaul commented Jul 25, 2023

Please prepare a microbenchmark for this. While an allocation is now avoided, the new implementation has more branches, so it will be interesting to see whether the tradeoff was worth it.

After benchmarks and some changes to the algorithm, the new way seems to outperform the old way by 15% to 40% in various cases. I believe this suffices to merge this.

NICE! Well done! Would be happy to see details of this tests.

@wprzytula
Copy link
Collaborator Author

Please prepare a microbenchmark for this. While an allocation is now avoided, the new implementation has more branches, so it will be interesting to see whether the tradeoff was worth it.

After benchmarks and some changes to the algorithm, the new way seems to outperform the old way by 15% to 40% in various cases. I believe this suffices to merge this.

NICE! Well done! Would be happy to see details of this tests.

I'll clean the benchmarks up and include them in our codebase, so they can guard against unintended performance regressions in the future.

@wprzytula
Copy link
Collaborator Author

These are results of the benchmarks: the old algorithm on the right (percentage not relevant), the new algorithm on the left (percentage shows improvement relative the the old one).
Screenshot from 2023-07-25 16-40-08

@wprzytula
Copy link
Collaborator Author

CI fails due to use of Option::unzip(), which is stable since Rust 1.66, whereas our MSRV is 1.65. Can we bump it @piodul?
Apart from that, ready for review.

@wprzytula wprzytula added this to the 0.10.0 milestone Jul 28, 2023
@wprzytula wprzytula added the performance Improves performance of existing features label Jul 30, 2023
@wprzytula wprzytula force-pushed the stateful-partitioners branch 2 times, most recently from 0aa8953 to 286044f Compare July 31, 2023 12:59
@wprzytula
Copy link
Collaborator Author

I reimplemented Option::unzip() (technically, copied its code from the standard library) temporarily, not to force MSRV bump yet.

@wprzytula
Copy link
Collaborator Author

Rebased on main.

scylla/src/statement/prepared_statement.rs Outdated Show resolved Hide resolved
scylla/src/transport/partitioner.rs Outdated Show resolved Hide resolved
scylla/src/transport/partitioner.rs Outdated Show resolved Hide resolved
scylla/src/statement/prepared_statement.rs Show resolved Hide resolved
scylla/src/transport/cluster.rs Outdated Show resolved Hide resolved
scylla/src/utils/mod.rs Show resolved Hide resolved
scylla/src/transport/session.rs Show resolved Hide resolved
scylla/src/transport/iterator.rs Outdated Show resolved Hide resolved
frame::types,
frame::value::ValueList,
transport::partitioner::{calculate_token_for_partition_key, Murmur3Partitioner},
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: putting the commit with the benchmark at the beginning of the PR should make it easier to use it to compare performance before and after the changes.

scylla/src/statement/prepared_statement.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another batch of comments - hopefully the last one this time.

/// Instances of this trait are created by a `Partitioner` and are stateful.
/// At any point, one can call `finish()` and a `Token` will be computed
/// based on values that has been fed so far.
pub trait PartitionerHasher {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The first thing that comes to my mind when I see PartitionerHasher is that it hashes partitioners. I don't have a good, alternative suggestion, though.

scylla/src/statement/prepared_statement.rs Outdated Show resolved Hide resolved
scylla-cql/src/frame/response/result.rs Show resolved Hide resolved
scylla/src/statement/prepared_statement.rs Outdated Show resolved Hide resolved
scylla/src/statement/prepared_statement.rs Outdated Show resolved Hide resolved
prepared_metadata: &PreparedMetadata,
partition_key: Option<&Bytes>,
pub(crate) fn new_prepared<'ps>(
partition_key: Option<impl Iterator<Item = (&'ps [u8], &'ps ColumnSpec)> + Clone>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be that much generic? Does it really make sense to put here anything that isn't an Option<PartitionKey>? Perhaps doing so will make it easier to fix the lifetime issue in partition_key_displayer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried it up and gave up. Apparently, the Iterator returned by PartitionKey::iter() is less problematic in terms of lifetime issue than PartitionKey itself.

scylla/src/transport/session.rs Show resolved Hide resolved
@wprzytula wprzytula requested a review from piodul August 21, 2023 08:54
An `else` branch is introduced to make code flow clearer, and a typo is
fixed.
The Partitioner API is changed into stateful one, resembling std::hash
traits. The motivation is simple: having Partitioners hold state enables
feeding them with data in chunks instead of one contiguous slice, which
saves us an allocation.
Since now, Partitioners' users are encouraged to first build a hasher
using `Partitioner::build_hasher()` and then feed it in chunks with
`write()`. Finally, `finish()` is to be called and a token is returned.
The `Partitioner::hash()` method was renamed to `hash_one()` for closer
correspondence to `std::hash::BuildHasher` trait.
This is an umbrella enum for all available PartitionerHashers. It is
analogous to PartitionerName in this regard, as well as when it comes to
its usage.
In the next few commits, the logic concerning partition key and token
computation will be split into two steps:
1) extracting partition key from serialized values
2) calculating token based on the extracted partition key

Therefore, to gain more granularity over errors, `PartitionKeyError` is
divided into `PartitionKeyExtractionError` and `TokenCalculationError`.
This will be needed for clean error handling in the new API.
The recently added `calculate_token_for_partition_key()` duplicates the
logic that `ClusterData::compute_token()` uses. To deduplicate,
`compute_token()` now calls `calculate_token_for_partition_key()`.
It is clear that the algorithm for computing partition key for a given
prepared statement and bound values can be divided into steps. The first
of them involves extracting values that consistute the partition key and
putting them in proper partition key order. PartitionKey struct performs
this step on construction, and serves as an entry point for further
steps, with token calculation among others. For those further steps to
be done, an iterator over partition key values is accessible, which
yields pairs (value, spec) for each column that constitutes partition
key, in partition key order. It will be used to avoid materialising
partition key in a buffer.
The tests require PartialEq, Eq on TableSpec and ColumnSpec, so these
traits were derived unconditionally.
Traces generated upon requests are expected to contain the partition key
if available. In order to avoid an allocation for that, `PartitionKey`
struct is used, whose iterator yields the required values one by one
and then they are deserialised.
Partition key is extracted only once, and token is calculated only once,
too.

Not to raise our MSRV, `Option::unzip()` is temporarily reimplemented
as an util `unzip_option()`. It is to be replaced with `Option::unzip()`
once MSRV is raised to at least 1.66.
It's no longer useful, as now we operate straight on serialized values
before they get encoded in partition key format.
`Session::calculate_token()` is modified so that it takes advantage of
the new Partitioner API and hence avoids allocating the whole partition
key. Additionally, `calculate_token()` is now called upon queries
straight, instead of `calculate_partition_key()` first.

As a bonus, the mess around various flavours of
`[calculate|compute]_partition_key` functions is finally gone: two of
them could be completely removed.
It has no usages left, as `build_hasher()` is to be used now.
It should be exposed that calculating token without materialising
partition key is feasible now, and promote this way.
In effort to combat overloading Session with nonrelated functionality,
token computation functions are moved out from session.rs. Specifically,
`calculate_token()` is moved to prepared.rs and
`calculate_token_for_partition_key()` is moved to partitioner.rs.
The benchmark's results confirm that the new algorithm for stateful
partitioners is more performant that the old one, by about 15% to 40%.

In the future, the benchmark can be run to prevent unintended
performance regressions in partitioners.
@piodul piodul merged commit 83f4050 into scylladb:main Aug 22, 2023
8 checks passed
@wprzytula wprzytula deleted the stateful-partitioners branch March 5, 2024 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improves performance of existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants