You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to see both of these upstreamed here, as they use different interpretations of weights: A-Res is equivalent to (though much more efficient than) collecting the data elements and their weights into a discrete distribution, then sampling $n$ elements from that distribution, removing the selected element from it each time. VarOptK doesn't have an obvious sequential analog, but it has the very nice property that the probability of an element being included in the sample will be proportional to that element's relative weight (weight_i / total_weights), if possible.
At the very least, I'd like to add two new methods to IteratorRandom: choose_multiple_weighted and choose_multiple_weighted_proportionally. The names are obviously subject to bikeshedding, I'm not entirely happy with them either.
I also propose exposing an advanced API that these methods would simply wrap. While I personally don't have a use case for this, a major reason to want a reservoir sampling algorithm is that they can take in the sampled population one element at a time, and at any point extract a sample of the part of the population seen so far. Essentially, I want to decouple the implementation of these algorithms from the Iterator trait, and allow users to do this, which would add indirect support for sampling infinite iterators or async streams, for example. I also suggest giving the existing IteratorRandom::choose_multiple the same treatment.
Details
I'd add a module rand::reservoir containing three types corresponding to these three algorithms, each with essentially the same API:
structReservoir<T>{/* ... */}impl<T>Reservoir<T>{fnwith_sample_size(n:usize) -> Self{/* ... */}fnfeed<R:Rng + ?Sized>(&mutself,rng:&mutR,item:T,weight:f64) -> ??? {// Should this simply return `bool` to indicate whether this item was selected?// Or a more complex type, like `Result<Option<T>, T>`, to return `Err(item)` if it wasn't selected,// and `Ok` with the potentially replaced item from the reservoir if it was?// Do we need a return value at all?}fncurrent_sample(&self) -> implIterator<Item = &T>{/* ... */}// Maybe also something like this:fnfeed_to_next_replacement<I,F>(&mutself,rng:&mutR,items:&mutI,weights:F) -> Option<T>whereR:Rng + ?Sized,I:Iterator<Item = T>,F:FnMut(&T) -> f64,{/* return `None` if `items` is exhausted before the sample is updated, `Some(replaced_item)` otherwise */}// Other convenience functions?}impl<T>IntoIteratorfor&Reservoir<T>{/* equivalent to `current_sample()` */}impl<T>IntoIteratorforReservoir<T>{/* consume the Reservoir to get an iterator of owned items */}impl<T>Into<Vec<T>>forReservoir<T>{/* equivalent to, but potentially more efficient than, `into_iter().collect()` */}// Other traits to implement?
Again, I'm not sure what to name these types. The best descriptive names I can come up with would be along the lines of Uniform, Weighted, and ProportionallyWeighted, but there's the unfortunate name clashing. Naming the structs after the corresponding algorithms or their inventors would be an option, but that's very opaque, I feel. Suggestions would be appreciated.
I guess there's a case to be made for making a new trait ReservoirSampler (or similar) that all three structs would implement, and making the choose_multiple method generic over the reservoir type, but that seems over-engineered to me.
I think it's also worth discussing whether to also extend the SliceRandom trait. If I'm reading the implementation of SliceRandom::choose_multiple_weighted right, it's equivalent to slice.iter().choose_multiple_weighted(), so maybe it should also get an equivalent to slice.iter().choose_multiple_weighted_proportionally()? That doesn't seem necessary to me either, but users may expect that kind of symmetry.
I could probably bang out a pull request for all this relatively quickly - I already have working, if rudimentary implementations, after all - , I just wanted to gauge interest first, and get some opinions on these surface level questions.
The text was updated successfully, but these errors were encountered:
My apologies for not getting to this sooner. This is functionality it would be nice to see included somehow.
API organisation
We have the rand::seq::index module which is vaguely analogous (functionality related to implementing the methods on sequence traits). I could see rand::seq::reservoir living alongside this, although...
... given the amount of existing and proposed behaviour surrounding weighted sampling (WeightedIndex in rand, WeightedAliasIndex in rand_distr, IndexedRandom::choose_weighted and choose_multiple_weighted; this issue, #1053, and #1459), it may be time for a new rand_weighted crate or some such. Or maybe this is the wrong partitioning. @newpavlov? @vks?
Algorithms
We already have (I think) the A-res algorithm implemented (rand::seq::index::sample_efraimidis_spirakis). This could perhaps be made more generic to use arbitrary elements instead of an index.
Summary
Recently, while doing research for a blog post, I implemented two different schemes for weighted reservoir sampling: the A-Res algorithm by Efraimidis and Spirakis, and VarOptK by Cohen et al..
I'd like to see both of these upstreamed here, as they use different interpretations of weights: A-Res is equivalent to (though much more efficient than) collecting the data elements and their weights into a discrete distribution, then sampling$n$ elements from that distribution, removing the selected element from it each time. VarOptK doesn't have an obvious sequential analog, but it has the very nice property that the probability of an element being included in the sample will be proportional to that element's relative weight (
weight_i / total_weights
), if possible.At the very least, I'd like to add two new methods to
IteratorRandom
:choose_multiple_weighted
andchoose_multiple_weighted_proportionally
. The names are obviously subject to bikeshedding, I'm not entirely happy with them either.I also propose exposing an advanced API that these methods would simply wrap. While I personally don't have a use case for this, a major reason to want a reservoir sampling algorithm is that they can take in the sampled population one element at a time, and at any point extract a sample of the part of the population seen so far. Essentially, I want to decouple the implementation of these algorithms from the
Iterator
trait, and allow users to do this, which would add indirect support for sampling infinite iterators or async streams, for example. I also suggest giving the existingIteratorRandom::choose_multiple
the same treatment.Details
I'd add a module
rand::reservoir
containing three types corresponding to these three algorithms, each with essentially the same API:Again, I'm not sure what to name these types. The best descriptive names I can come up with would be along the lines of
Uniform
,Weighted
, andProportionallyWeighted
, but there's the unfortunate name clashing. Naming the structs after the corresponding algorithms or their inventors would be an option, but that's very opaque, I feel. Suggestions would be appreciated.Anyway, from this, we'd get:
Alternatives and Concluding Notes
I guess there's a case to be made for making a new trait
ReservoirSampler
(or similar) that all three structs would implement, and making thechoose_multiple
method generic over the reservoir type, but that seems over-engineered to me.I think it's also worth discussing whether to also extend the
SliceRandom
trait. If I'm reading the implementation ofSliceRandom::choose_multiple_weighted
right, it's equivalent toslice.iter().choose_multiple_weighted()
, so maybe it should also get an equivalent toslice.iter().choose_multiple_weighted_proportionally()
? That doesn't seem necessary to me either, but users may expect that kind of symmetry.I could probably bang out a pull request for all this relatively quickly - I already have working, if rudimentary implementations, after all - , I just wanted to gauge interest first, and get some opinions on these surface level questions.
The text was updated successfully, but these errors were encountered: