Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partitioningIndex(where:) for Sorted Set/Dictionary #256

Open
dabrahams opened this issue Dec 21, 2022 · 11 comments
Open

partitioningIndex(where:) for Sorted Set/Dictionary #256

dabrahams opened this issue Dec 21, 2022 · 11 comments
Labels
enhancement New feature or request SortedCollections
Milestone

Comments

@dabrahams
Copy link

dabrahams commented Dec 21, 2022

I realize the docs say this code is still unstable, but it's never too early to ask for more, right? 😉

There should be something analogous to partitioningIndex(where:) for the sorted associative containers. An example of why: I have a multimap K:V where the chance of having multiple V's for any K is very low and for any K, all Vs are unique. The best storage format is roughly as a sorted set of lexicographically ordered (K,V) pairs. There should be a way to find the beginning and end of the sequence of contiguous items with a given K value, without (obviously) the cost of treating the tree as a flat collection and using indices.

In fact I'm very surprised not to see something like this in the BTree source (though it's so spread out, I may have missed it); it seems like contains and perhaps a few other things might be able to be built upon it. Thanks!

@dabrahams dabrahams added the enhancement New feature or request label Dec 21, 2022
@lorentey
Copy link
Member

lorentey commented Dec 23, 2022

Absolutely! This is one area where we're currently missing key functionality, and this will definitely need to be addressed before shipping these types.

I've grown to really like using a subscript to allow easy lookups for any range of items in sorted collections. For example, BitSet is a sorted collection of nonnegative integers, represented as a simple bit vector, and it provides this memberwise slicing operation:

let bits: BitSet = [2, 5, 6, 8, 9]
let a = bits[members: 3..<7] // [5, 6]
let b = bits[members: 4...] // [5, 6, 8, 9]
let c = bits[members: ..<8] // [2, 5, 6]

(https://github.com/apple/swift-collections/blob/main/Sources/BitCollections/BitSet/BitSet%2BExtras.swift#L81-L121)

BitSet is approximating production readiness now -- I'm reasonably confident that its implementation is korrekt, but it's still pending API review. I'll be most excited to collect feedback on these parts, as I expect BitSet will set the template for every other sorted collection, including the (far more exciting) SortedSet/Dictionary. I expect to run its review early next year, alongside its lesser counterpart BitArray.

Generalizing this operation for sorted [multi]dictionaries, dict[keys: k...k] would be the idiomatic way to find all the key-value pairs with the key k. (With something like dict.values[keys: k...k] as a shortcut to get just the values.)

These subscripts may also provide a reason to introduce <.. and <..< (<.<?) operators -- despite my reluctance to embrace them.

(Edit: in fact, for a true multidictionary, we probably want the core dict[key] operation to directly return something functionally similar to a slice of the values view, except ideally we would want it to be range-replaceable. Whether that'll be practical given the limitations around in-place mutable slices remains to be seen; however, noncopyable types may open some new design directions.)

@lorentey lorentey added this to the 1.2.0 milestone Dec 23, 2022
@dabrahams
Copy link
Author

Yes, subscripts make sense for any kind of projection, and slicing is a good example. I suppose you're aware that if your BitSet simply used Ints for an Index type, you could do all of those with the regular slicing subscript? I've been working with my own implementation, FWIW.

Generalizing this operation for sorted [multi]dictionaries, dict[keys: k...k] would be the idiomatic way to find all the key-value pairs with the key k.

Sure, but please don't neglect to expose the fundamental partition point-finding operation on which that would be based!

BTW, I hope you're aware of https://github.com/loftware/StandardLibraryProtocolChecks which can be useful for your test coverage.

@lorentey
Copy link
Member

I suppose you're aware that if your BitSet simply used Ints for an Index type, you could do all of those with the regular slicing subscript?

Yes, and that's exactly how indices work in BitArray, but I think doing that would be too misleading for BitSet, whose indices address only the set bits.

The guideline I'm strictly enforcing is to never use a Strideable type for Index unless its advanced(by:) operation is equivalent to the collection's index(_:offsetBy:) method (ignoring any bounds checking that may be done by the latter). Introducing a trivial wrapper costs nothing, but it eliminates a common error.

I've been working with my own implementation, FWIW.

Interesting! I'm also so very tempted to make BitSet (and Deque, OrderedSet etc) generic over the underlying storage type -- the current implementations are hardwiring these, which crucially limits their usefulness. The thing that holds me back is that having to spell the type as, say, BitSet<ContiguousArray<UInt>> would be far too pedantic for simple use cases.

Perhaps this would be more viable if we could set default values for generic parameters. Failing that, my plan is to eventually add the fully generic types under different names, and to change the current types to be simple wrappers (or even just typealiases).

@lorentey
Copy link
Member

Sure, but please don't neglect to expose the fundamental partition point-finding operation on which that would be based!

Do you think we need to explicitly expose that as a core operation? (Isn't it generally equivalent to the proposed set[members: k...]?)

@lorentey
Copy link
Member

BTW, I hope you're aware of https://github.com/loftware/StandardLibraryProtocolChecks which can be useful for your test coverage.

I am! 👍 Have you seen _CollectionsTestSupport? It operates in the same area, although it remains a hidden implementation detail, rather than a widely reusable component. This is such an obviously fundamental thing, it ought to be a project-level effort. So many things to work on!

@dabrahams
Copy link
Author

Yeah, I did see that. I was wondering, what's the point of MinimalTypes?

@dabrahams
Copy link
Author

Do you think we need to explicitly expose that as a core operation? (Isn't it generally equivalent to the proposed set[members: k...]?)

I kinda think you do, especially if you're going to have insert-with-hint functionality, and especially if indices are going to continue to contain an array. It's a matter of exposing the basis operations of the datatype in a way that they can easily be known, and without loss of efficiency.

@lorentey
Copy link
Member

lorentey commented Dec 24, 2022

Yeah, I did see that. I was wondering, what's the point of MinimalTypes?

Those are reusable types with deliberately over-pedantic (and/or unusually configurable) conformances that can be used to exercise commonly forgotten code paths in functions & types that are generic over the corresponding protocols. The Minimal prefix is not necessarily the most correct way to name these, but it works, and it matches the stdlib test suite, which is where these loosely come from.

The package tests don't often use these at the moment, other than MinimalDecoder/MinimalEncoder, which I find to be the least painful (if incomplete) way to test Codable conformances. MinimalSequence/MinimalCollection are also used in spots to better test member functions that need to take arbitrary sequences/collections. (Most of the rest I originally used to test the initial Deque/OrderedSet designs, where the storage type came from a generic type parameter -- those aren't used in the current package, but I expect this isn't going to remain like that forever.)

@lorentey
Copy link
Member

lorentey commented Dec 24, 2022

Do you think we need to explicitly expose that as a core operation? (Isn't it generally equivalent to the proposed set[members: k...]?)

I kinda think you do, especially if you're going to have insert-with-hint functionality, and especially if indices are going to continue to contain an array. It's a matter of exposing the basis operations of the datatype in a way that they can easily be known, and without loss of efficiency.

Hm; I do expect that indices will continue to encode a path in the tree, although the implementation is not final. For indexing to be efficient in high-fanout trees, indices need to also contain an unsafe/unowned reference to the final node on the path, which has some unfortunate consequences -- such as that mutations need to invalidate every index. This makes indices fragile, fleeting things that tend to fall apart at the slightest change.

That said, I can't say I get how the index representation would matter when it comes to expressing the core lookup operation. Can you expand a little bit on what you mean?

I don't think spelling the partitioning operation as, say, d[members: k...].startIndex would necessarily result in any measurable loss of efficiency. I'm not opposed to adding dedicated member(s) if it turns out it does -- however, I also do not currently consider this particular operation to be important enough to deserve its own dedicated spelling. Importantly, my feeling so far is that range expressions involving members are more intuitively readable than something like, say, partitioningIndex(of:). (Even more so if we got rid of the subscript label, as I hope we'll be able to do. I find it difficult to beat the sheer obviousness of d[k...].startIndex as the partitioning syntax.)

I'm very open to getting convinced, though!

(I wasn't planning to have an insert-with-hint in the initial release. Something more like the old cursor construct may be something to consider later, though. A hint seems far easier to validate if it sits beside the collection that produced it, in a mutable construct that owns and maintains both. Additionally, allowing the mutator construct to loosen some collection invariants (such as the tree's augmentation with subtree counts) can be a crucial performance boost when applying a batch of updates.)

@dabrahams
Copy link
Author

Have you seen _CollectionsTestSupport? It operates in the same area, although it remains a hidden implementation detail, rather than a widely reusable component. This is such an obviously fundamental thing, it ought to be a project-level effort.

While there's a little bit of overlap, I think there's quite a bit that's distinct. You should steal that code and feel free to turn it into whatever you want. Moreover it looks like the collection adapters in the SwiftAlgorithms collection don't have any testing in the same area.

@dabrahams
Copy link
Author

That said, I can't say I get how the index representation would matter when it comes to expressing the core lookup operation. Can you expand a little bit on what you mean?

It's just that the range you return from that operation is going to contain two indices instead of just one. it's a bit more overhead.

I don't think spelling the partitioning operation as, say, d[members: k...].startIndex would necessarily result in any measurable loss of efficiency. I'm not opposed to adding dedicated member(s) if it turns out it does -- however, I also do not currently consider this particular operation to be important enough to deserve its own dedicated spelling.

That's where I disagree with you. The principle of generic programming I'm upholding is that when you discover the fundamental building blocks of useful operations/components, you expose them, because then other high-performance components can be built on top.

Importantly, my feeling so far is that range expressions involving members are more intuitively readable than something like, say, partitioningIndex(of:). (Even more so if we got rid of the subscript label, as I hope we'll be able to do. I find it difficult to beat the sheer obviousness of d[k...].startIndex as the partitioning syntax.)

or d[..<k].endIndex. Now there are two equally valid spellings for the same fundamental operation. Furthermore, you've totally lost the semantic connection to the algorithm that operates on a general collection that is doing the same thing. BTW, I loathe the name partitioningIndex(where:), and the only reason I suggested it was that's the name being used in SwiftAlgorithms. My point here is that there should be a consistent spelling.

(I wasn't planning to have an insert-with-hint in the initial release. Something more like the old cursor construct may be something to consider later, though. A hint seems far easier to validate if it sits beside the collection that produced it, in a mutable construct that owns and maintains both. Additionally, allowing the mutator construct to loosen some collection invariants (such as the tree's augmentation with subtree counts) can be a crucial performance boost when applying a batch of updates.)

I don't know, I'm very much regretting the comparability requirement for indices and a few other things. Looking forward to greenfield stdlib development with Val so I can make new/different mistakes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SortedCollections
Projects
None yet
Development

No branches or pull requests

2 participants