Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Add System.Collections.Concurrent.ConcurrentSet<T> #110825

Open
MapleWheels opened this issue Dec 18, 2024 · 19 comments
Open

[API Proposal]: Add System.Collections.Concurrent.ConcurrentSet<T> #110825

MapleWheels opened this issue Dec 18, 2024 · 19 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Collections
Milestone

Comments

@MapleWheels
Copy link

MapleWheels commented Dec 18, 2024

Background and motivation

Copying this issue but with an addition to the type's use case.

I didn't see any open issues on this repo for it. If there are, sorry in advance.

This is something I was searching for when I needed to create a collection with near O(1) lookup of combined key-values, where the struct itself contains the key and value data, this is very useful in both resource lookups with custom Multi-Key (OR compare logic against select members) types for dictionaries.

This is very useful for in-memory resources lookups from disk-loaded or network-retrieved assets. See below for an example of a sample based on our usage. The implementation is similar to the previous proposal but with the addition of some new functions to justify it's use.

This should already be a thing, because .NET already implements this in the default HashSet<T> with my exact justification in the remarks: source

Image

API Proposal

Edit: removed IEqualityComparer<T> constraint.

Proposal for modified continuation of implementation of: ConcurrentSet<T>.

namespace System.Collections.Concurrent;

public class ConcurrentSet<T> : System.Collections.Generic.ICollection<T>, System.Collections.Generic.IEnumerable<T>, System.Collections.Generic.IReadOnlyCollection<T>, System.Collections.ICollection,System.Collections.IEnumerable 
where T : notnull
{
    // partial API additions to: https://github.com/dotnet/runtime/issues/39919
    
    /// <summary>
    /// Adds the given value to, or replaces an existing value with the same equality in, the set.
    /// </summary>
    /// <param name="newValue">The new value to replace.</param>
    /// <returns></returns>
    public void AddOrReplace(T newValue){ /* */}
}

API Usage

// data  class
public readonly record struct AssemblyStringResource : IEqualityComparer<AssemblyStringResource>
{
    public readonly string AssemblyName;
    public readonly Assembly Assembly;

    // other data not keyed, could be MetaRefs, etc.
    public readonly int SomeData1;
    public readonly float SomeData2;
    public readonly bool SomeData3;
    
    public readonly int HashCode;

    // data struct creation
    public AssemblyStringResource(Assembly assembly, int someData1, float someData2, bool someData3)
    {
        Assembly = assembly;
        AssemblyName = assembly.GetName().Name; 
        // just using the name string for the sake of example
        HashCode = AssemblyName!.GetHashCode(); 
        SomeData1 = someData1;
        SomeData2 = someData2;
        SomeData3 = someData3;
    }

    // key ctor: Assembly
    public AssemblyStringResource(Assembly assembly)
    {
        if (assembly == null)
            throw new ArgumentNullException(nameof(assembly));
        Assembly = assembly;
        AssemblyName = assembly.GetName().Name;
        // just using the name string for the sake of example
        HashCode = AssemblyName!.GetHashCode(); 
    }

    // key ctor: AssemblyName
    public AssemblyStringResource(string assemblyName)
    {
        if (assemblyName.IsNullOrEmpty())
            throw new ArgumentNullException(nameof(assemblyName));
        AssemblyName = assemblyName;
        // just using the name string for the sake of example
        HashCode = AssemblyName!.GetHashCode(); 
        Assembly = null;
    }
    
    public static implicit operator AssemblyStringResource(Assembly assembly) => new AssemblyStringResource(assembly);
    public static implicit operator AssemblyStringResource(string assemblyName) => new AssemblyStringResource(assemblyName);

    public bool Equals(AssemblyStringResource x, AssemblyStringResource y)
    {
        if (x == y)
            return true;
        if (x.Assembly is not null && y.Assembly is not null)
            return x.Assembly == y.Assembly;
        if (!x.AssemblyName.IsNullOrEmpty() && y.AssemblyName.IsNullOrEmpty())
            return x.AssemblyName == y.AssemblyName;
        return false;
    }

    public int GetHashCode(AssemblyStringResource obj)
    {
        return obj.HashCode;
    }

    public override int GetHashCode()
    {
        return GetHashCode(this);
    }

    public override string ToString()
    {
        return AssemblyName;
    }

    public bool Equals(AssemblyStringResource? other)
    {
        return Equals(this, other);
    }
}


// example usage:

public interface IAssemblyService
{
    bool IsAssemblyLoaded(string assemblyName);
    IEnumerable<Type> GetTypesInAssembly(string assemblyname);
}

internal interface IIIInternalAssemblyService
{
    bool TryGetAssemblyRes(Assembly assembly, out AssemblyStringResource? res);
    bool TryGetAssemblyRes(string assembly, out AssemblyStringResource? res);
    void UpdateAssemblyList(AssemblyStringResource res);
}


// I know this isn't proper code, it's just an example :)
public class AssemblyService :  IAssemblyService, IIIInternalAssemblyService
{
    private readonly ConcurrentSet<AssemblyStringResource> _assemblyResources = new();

    public AssemblyService(IResourceService service)
    {
        foreach (var res in service.GetAssemblyResources())
        {
            _assemblyResources.Add(res); // populate
        }
    }

    bool IAssemblyService.IsAssemblyLoaded(string assemblyName)
    {
        return _assemblyResources.Contains(assemblyName); // implicit conversion search
    }

    IEnumerable<Type> IAssemblyService.GetTypesInAssembly(string assemblyName)
    {
        if (!_assemblyResources.TryGetValue(assemblyName, out var res)
            return null;
        if (res.Assembly is null)
        {
            return null;
        }
        return Assembly.GetTypes();
    }

    bool IIIInternalAssemblyService.TryGetAssemblyRes(Assembly assembly, out AssemblyStringResource? res)
    {
        if (!_assemblyResources.TryGetValue(assembly, out var res)
        {
            res = null;
            return false;
        }
        return res;
    }
    bool IIIInternalAssemblyService.TryGetAssemblyRes(string assembly, out AssemblyStringResource? res)
    {
        if (!_assemblyResources.TryGetValue(assembly, out var res)
        {
            res = null;
            return false;
        }
        return res;
    }

    void IIIInternalAssemblyService.UpdateAssemblyList(AssemblyStringResource res)
    {
         _assemblyResources.AddOrReplace(res);
    }
}

Alternative Designs

No response

Risks

The breaking change would be the addition of AddOrReplace() to GenericICollection<T> and other related interfaces.

@MapleWheels MapleWheels added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Dec 18, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 18, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 18, 2024
@teo-tsirpanis teo-tsirpanis added area-System.Collections and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Dec 18, 2024
@KalleOlaviNiemitalo
Copy link

type T should implement IEqualityComparer<T>

That would be weird. The convention for hashed collections is that equality comparisons and hash codes are provided by one of:

  • IEqualityComparer<T> given as a parameter to the collection constructor
  • IEquatable<T> if T implements that
  • overrides of Object.GetHashCode() and Object.Equals(Object)

@ovska
Copy link

ovska commented Dec 19, 2024

Could your use case be satisfied with ConcurrentDictionary<T, T>, using dict[obj] = obj instead of AddOrReplace

@MapleWheels
Copy link
Author

MapleWheels commented Dec 19, 2024

type T should implement IEqualityComparer<T>

That would be weird. The convention for hashed collections is that equality comparisons and hash codes are provided by one of:

  • IEqualityComparer<T> given as a parameter to the collection constructor
  • IEquatable<T> if T implements that
  • overrides of Object.GetHashCode() and Object.Equals(Object)

I honestly didn't think about that point enough, good catch. This serves the intended purpose regardless. Edited the original comment.

@MapleWheels
Copy link
Author

MapleWheels commented Dec 19, 2024

Could your use case be satisfied with ConcurrentDictionary<T, T>, using dict[obj] = obj instead of AddOrReplace

Semantically, yes if this was just a HashSet underneath that implemented IDictionary<T,T> with a bonded Key-Value object (the value assignment also updates the key) but in implementation/as a substitute to ConcurrentSet, no because this results in a duplication of the structs. The example I used is simple but in more complex cases this would add a lot of memory overhead and potentially desync between the key and stored value without boilerplate code.

@eiriktsarpalis
Copy link
Member

This seems related to #21603 (comment). TL;DR we've been resisting the inclusion of methods like GetOrAdd or AddOrReplace in set-like collections, since we don't want to encourage modelling of types that are equal but contain conflicting information.

Apart from that, I don't believe much has changed from when #39919 got closed. Even if it was implemented as a wrapper over ConcurrentDictionary it would be creating a substantial yet trivial new API surface.

@MapleWheels
Copy link
Author

MapleWheels commented Dec 19, 2024

This seems related to #21603 (comment). TL;DR we've been resisting the inclusion of methods like GetOrAdd or AddOrReplace in set-like collections, since we don't want to encourage modelling of types that are equal but contain conflicting information.

Apart from that, I don't believe much has changed from when #39919 got closed. Even if it was implemented as a wrapper over ConcurrentDictionary it would be creating a substantial yet trivial new API surface.

I read through a large chunk of the discussion and I mostly agree with @theodorzoulias .

I can understand not adding more definitions. I would be satisfied with a concurrent TryGetValue() and can just put my own boilerplate on top as needed (although a replace would be nice).

The reason this is not a trivial API surface is Databases. What this API proposal fundamentally serves is having data sets with unique keys, foreign keys and/or compound keys defined within the data struct. This type of lookup is already commonplace! However, .NET doesn't support it for some reason.

The original authors for the HasSet API adding TryGetValue did not make a mistake. It was a thoughtful decision to implement a collection that supported developer-designed complex lookups on a collection without data duplication/desync (Dictionary KVP) without having O(n) lookup time.

Look, being frank. Are you suggesting that we all just use ConcurrentBag and live with O(n), or conversely use ConcurrentDictionary<T,T> with developer-made boilerplate and have double the memory footprint for value keys? Implementing a custom key is a recipe for data desync between the key and the value stored, among other things.

What is the option we're supposed to use for a concurrent-safe, multi-key/compound-key query on a collection? Because right now, there isn't one outside of me maintaning a bunch of dictionaries, one for every key type!

@eiriktsarpalis
Copy link
Member

The original authors for the HasSet API adding TryGetValue did not make a mistake. It was a thoughtful decision to implement a collection that supported developer-designed complex lookups on a collection without data duplication/desync (Dictionary KVP) without having O(n) lookup time.

Agree to disagree, I suppose. Even if we did add ConcurrentSet in the future, I doubt we would be adding a TryGetValue method there.

or conversely use ConcurrentDictionary<T,T> with developer-made boilerplate and have double the memory footprint for value keys?

I'm not convinced that duplicating the key is going to be the dominant size factor in the context of concurrent collections. Happy to be proven wrong if you have profiling data to share.

Implementing a custom key is a recipe for data desync between the key and the value stored, among other things.

Can you clarify what you mean by this? Are you suggesting that mutations in the value can change its equality?

@MapleWheels
Copy link
Author

MapleWheels commented Dec 19, 2024

Agree to disagree, I suppose. Even if we did add ConcurrentSet in the future, I doubt we would be adding a TryGetValue method there.

The whole purpose of this, is as stated in the remarks on MSDN, and that is literally the use-case. How can you say that it's a 'mistake' when .NET does not provide an alternative API for this? The use case here is one that's a billion-dollar industry: databases. The entire purpose of this is to support a "Queryable Dataset" with O(1) lookup time,; with the constraint of requiring a unique key in your query. Again, I asked you if there's an alternative and you ignored by question.

Implementing a custom key is a recipe for data desync between the key and the value stored, among other things.

Can you clarify what you mean by this? Are you suggesting that mutations in the value can change its equality?

or conversely use ConcurrentDictionary<T,T> with developer-made boilerplate and have double the memory footprint for value keys?

I'm not convinced that duplicating the key is going to be the dominant size factor in the context of concurrent collections. Happy to be proven wrong if you have profiling data to share.

So, the first issue here is that so long as the key and value are stored as separate entities. It is possible, again without boilerplate, to have a different struct in the key and value in the dictionary. This creates issues, such as "ghost references/GC handles".

For example, let's say we want to store some data for a resource, with the struct having the ability to store a reference to that resource when instantiated. It is entirely possible for the key to be added to the Dictionary at a time when the reference is alive, and the struct added as a key holds a GC handle, as well as the value.

I undserstand that this is not "best practices" but that doesn't mean that this isn't a real and subtle pitfall.

// note: the struct is immutable
var resource = new ResourceStruct
{
    Name = "name",
    Location = "/some/file/location.bin",
    Instance = SomeResourceFactory()
};
_myResourceCache[resource] = resource; // this resource key has a reference to the object

We decide it's time to unload resources, but we keep the metadata struct alive, just without the object reference:

//key implicit conversion and type implements IEqualityComparer<ResourceStruct>
_myResourceCache[cacheRef] = new ResourceStruct 
{
    Name = "name",
    Location = "/some/file/location.bin",
    Instance = null    // This doesn't matter because the 'key' object in the Dictionary still has a strong handle!
};

This is a pitfall. You need to clear the entry from the Dictionary first and then re-add it or the key object will hold a handle in itself. I am aware that I could store the reference elsewhere, but this example is just an illustration.

And before you say "Just make a custom key struct", I already have one, the data type's Interface implements IEqualityComparer<T> and I can use it as the key, as well as the value, and the equality is already defined in the type. That would be forcing devs to create redundant other key types and all of the boilerplate glue code, just so that we don't have someone accidentally do the above and then wonder why they are having issues, such as in this example case the resource not unloading.

And this goes outside of unloading obviously. Any mutable collection, even if the entries themselves are immutable, are going to have this problem so long as they can be replaced/updated. You can only update the value in a Dictionary, never the key, because the Dictionary was designed around immutable keys.

====
Going further, in terms of memory, we are working on a reworked API for third-party content, the result is that this resource collection is going to regularly be 1000s of entries at a minimum (typical use case right now is over 100 content packages, each having 10s of resources or more). Will this be a lot, maybe, maybe not, but that's not the point.

You already have HashSet<T>, why is there so much friction to adding a concurrent HashSet when you already have a concurrent version of everything else?

This is hacky maybe, but we use the IEqualityComparer as a way to specify returns from a Query, based on what values are null in the search key-instance struct provided, it is always unique, but there are multiple unique keys stored in the struct and we have a custom hashcode implementation for this.

===
Let's change the conversation a bit. Is there a new API structure that you would recommend to mimic an in-memory queryable database with O(1) lookup time built around IEqualityComparer<T>, IEquatable<T>? The constraint here is that this query uses unique keys, so you always only have one value for a given key.

@KalleOlaviNiemitalo
Copy link

I don't have an opinion on whether ConcurrentSet<T> should be added to .NET. But if you need it in your application now, I think you should try copying the ConcurrentDictionary<TKey, TValue> source code and removing TValue. You'd have to maintain the result yourself but you wouldn't need to wait for Microsoft decisions and new versions of .NET.

ConcurrentDictionary<TKey, TValue> does use some internal parts of .NET Runtime. It uses NonRandomizedStringEqualityComparer; but you wrote that you'd use ConcurrentSet<T> with T being a struct rather than System.String, so you can strip that out too.

ConcurrentDictionary<TKey, TValue> has some code using ConcurrentDictionaryTypeProps<TValue>.IsWriteAtomic. If you're going to update keys in ConcurrentSet<T> then you'll likely need a similar check for the key type.

@eiriktsarpalis
Copy link
Member

You already have HashSet<T>, why is there so much friction to adding a concurrent HashSet when you already have a concurrent version of everything else?

It's a matter of cost vs. benefit. Implementing collections is expensive, and our resources are finite. It's not every year that we allocate engineering time to work on a new collection type, but if we do we want to make sure that the new functionality has proportional impact. Besides that point, it is not the goal of System.Collections to offer every data structure under the sun. There is a lowest common denominator that we cater to, and for anything more bespoke one should consider consuming or publishing a NuGet package.

@eiriktsarpalis eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label Dec 19, 2024
@eiriktsarpalis eiriktsarpalis added this to the Future milestone Dec 19, 2024
@Clockwork-Muse
Copy link
Contributor

Note too that especially in the concurrent collection space, most of the functionality is also covered by implementations of memory caching libraries or databases, which may be more relevant solutions to the actual domain-specific problem being solved in a lot of these cases.

@MapleWheels
Copy link
Author

MapleWheels commented Dec 19, 2024

Note too that especially in the concurrent collection space, most of the functionality is also covered by implementations of memory caching libraries or databases, which may be more relevant solutions to the actual domain-specific problem being solved in a lot of these cases.

Edit: I am aware of Redis, and some others.

This is a fair point. My use case is that I have a giant collection of resources that I need to Query based on unique keys and it needs to be updated based on both disk-cached information and in-memory compiled assemblies + loaded binary assemblies.

I essentially need an in-memory database with high performance read/write but the keys are simple and unique and a HashSet<T> fits the bill just right for lookups. Anything heavier usually results in it being better to just write boilerplate around a ConcurrentDictionary<T,T>.

I want to avoid having to write and maintain a collection of HashSets with my own threading controls.

I know it would take months to implement but I've come across this whole "Lightweight, Concurrent, In-Memory Database" issue several times over the years and it's bugged me that a package doesn't exist that does the job as I've described.

@Tornhoof
Copy link
Contributor

You might want to look at MS faster or MS Garnet for your usecase. Garnet is basically the successor of Faster.
https://github.com/microsoft/FASTER
https://github.com/microsoft/garnet

@CyrusNajmabadi
Copy link
Member

Could your use case be satisfied with ConcurrentDictionary<T, T>, using dict[obj] = obj instead of AddOrReplace

Semantically, yes if this was just a HashSet underneath that implemented IDictionary<T,T> with a bonded Key-Value object (the value assignment also updates the key) but in implementation/as a substitute to ConcurrentSet, no because this results in a duplication of the structs. The example I used is simple but in more complex cases this would add a lot of memory overhead and potentially desync between the key and stored value without boilerplate code.

What about ConcurrentDict<T,bool>

@MapleWheels
Copy link
Author

MapleWheels commented Dec 22, 2024

Could your use case be satisfied with ConcurrentDictionary<T, T>, using dict[obj] = obj instead of AddOrReplace

Semantically, yes if this was just a HashSet underneath that implemented IDictionary<T,T> with a bonded Key-Value object (the value assignment also updates the key) but in implementation/as a substitute to ConcurrentSet, no because this results in a duplication of the structs. The example I used is simple but in more complex cases this would add a lot of memory overhead and potentially desync between the key and stored value without boilerplate code.

What about ConcurrentDict<T,bool>

See this comment here: #110825 (comment)

The fundamental problem is that the dictionaries do not give you access to the key object without doing an O(n) search on the Keys component of the dictionary. I am using structs, not classes, so I can't just cheat by doing Dictionary<object, object>.

@CyrusNajmabadi
Copy link
Member

The fundamental problem is that the dictionaries do not give you access to the key object

I don't know what you mean by "the key object"

do not give you access to the key object without doing an O(n) search on the Keys component of the dictionary. I

Why would it be O(n)?

I am using structs, not classes,

I don't know why that would matter.

--
Fundamentally, i don't see why a proposed ConcurrentSet<T> would solve this, if ConcurrenctDict<T,VoidResult>` wouldn't also solve it. If the latter can't solve it, how would the former?

@MapleWheels
Copy link
Author

MapleWheels commented Dec 23, 2024

The fundamental problem is that the dictionaries do not give you access to the key object

I don't know what you mean by "the key object"

The object/entity stored as the key in the kvpair?

do not give you access to the key object without doing an O(n) search on the Keys component of the dictionary. I

Why would it be O(n)?

Because you cannot retrieve the key using the hash lookup. You have to query/loop over the keys list.

I am using structs, not classes,

I don't know why that would matter.

Because structs are value types, meaning I can't cheat by doing Dict<T,T> and make the key and value the same reference pointer without wrapping it.

Edit: Also, since the type is immutable, even if they were reference types, any upserting would cause the key and value objects to be different instances.

-- Fundamentally, i don't see why a proposed ConcurrentSet<T> would solve this, if ConcurrenctDict<T,VoidResult>` wouldn't also solve it. If the latter can't solve it, how would the former?

See my previous post.

@CyrusNajmabadi
Copy link
Member

The object/entity stored as the key in the kvpair?

WDYM then that they do not give you access to it?

Because you cannot retrieve the key using the hash lookup. You have to query/loop over the keys list.

I don't understand what you mean by 'retrieve the key'. You have the key. You use it to lookup a value (in the dict case) or test for containment (in the set case). Why would you need to 'retrieve the key'. Furthermore, how would a ConcurrentSet help here? How does a concurrent set hel you 'retrieve the key'.

Because structs are value types, meaning I can't cheat by doing Dict<T,T> and make the key and value the same reference pointer without wrapping it.

I don't knkow what this means.

See my previous post.

i hve. it's still not making sense to me. I'm also not seeing how a hypothetical concurrent set makes this possible. A concurrent set it just a set that can be used safely/concurrently from multiple threads. It would otherwise be the same as any other set type in the bcl (like HashSet).

@MapleWheels
Copy link
Author

MapleWheels commented Dec 23, 2024

See my previous post.

i hve. it's still not making sense to me. I'm also not seeing how a hypothetical concurrent set makes this possible. A concurrent set it just a set that can be used safely/concurrently from multiple threads. It would otherwise be the same as any other set type in the bcl (like HashSet).

You clearly have not? The key is one or more members of the data object using a custom IEqualityComparer. The key put into the dictionary is implicitly converted into a new instance with only the members used as keys initialized for equality logic.

This allows me to find the other data object of equality in the dictionary, which has actual data on it. The problem is the real data object is being stored as a key, not a value, since it's used as a key.

HashSet solves this implementating TryGetValue().

This was all in my post above. #110825 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Collections
Projects
None yet
Development

No branches or pull requests

8 participants