-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AbstractUnivariateFinite
for performance customisation
#70
Comments
I can't think of any reason not to add an abstract type. The Boolean case would be a high use case, for sure, but I probably don't have the bandwidth to actually add specialisations, as you suggest. Just curious, in what context are you making use of the package? |
Thanks, that's reasonable! The context is helping @tiemvanderdeure write SpeciesDistributionModels.jl on top of MLJ.jl. SDMs predict presence or absence of species, so all boolean classifiers. MLJ has been amazing for interop and standardization so far, (I was thinking about having a static version with keys in the types, it would be very fast in some contexts, even for non-Bool categories) |
Are you possibly constructing arrays of distributions by individually constructing each distribution, rather than calling the julia> probs = rand(10000);
julia> @btime UnivariateFinite($probs, augment=true, pool=missing);
97.113 μs (241 allocations: 415.92 KiB)
julia> @btime [UnivariateFinite([probs[i],], augment=true, pool=missing) for i in 1:10000];
278.524 ms (2369497 allocations: 142.36 MiB) |
In this case Which results in that calling import MLJ, CategoricalArrays, Shapley
RFC = MLJ.@load RandomForestClassifier pkg=DecisionTree
x = (a = rand(100), )
y = CategoricalArrays.categorical(x.a .> rand(100))
mach = MLJ.machine(RFC(), x, y)
MLJ.fit!(mach)
shap_wo_pdf(mach = mach, x = x) = Shapley.shapley(x -> MLJ.predict(mach, x), Shapley.MonteCarlo(100), x);
shap_w_pdf(mach = mach, x = x) = Shapley.shapley(x -> MLJ.pdf.(MLJ.predict(mach, x), true), Shapley.MonteCarlo(100), x); |
Okay, so my comment does not apply. Since |
I feel like optimizations on Is there a reason you store the category labels as runtime values, instead of putting them in the type like a |
I think we had in mind the case of large class cardinality, which admittedly is not a big use case in the ML applications. But I'm not convinced that adding these as types would preclude the need for the wrapper for arrays, to get speed (which was terrible before the wrapper was added). Maybe, if you are also dumping the dictionaries, but then |
Ah right, yes the NamedTuple strategy only scales to ~50 classes. Its the classic DataFrames,jl/TypedTables.jl split. I guess just optimising Sorry I don't know enough details about how |
UnivariateFinite
can be overkill for e.g. simpleBool
categories.It looks like they could be at least 50x faster to construct using a custom struct without the
LittleDict
.Adding a
AbstractUnivariateFinite
would allow defining objects that specialised on type.Currently, the profile using Shapley.jl with
Bool
categories looks like this:The text was updated successfully, but these errors were encountered: