Add `AbstractUnivariateFinite` for performance customisation #70

rafaqz · 2023-11-29T22:08:19Z

UnivariateFinite can be overkill for e.g. simple Bool categories.

It looks like they could be at least 50x faster to construct using a custom struct without the LittleDict.

Adding a AbstractUnivariateFinite would allow defining objects that specialised on type.

Currently, the profile using Shapley.jl with Bool categories looks like this:

The text was updated successfully, but these errors were encountered:

ablaom · 2023-11-29T23:35:42Z

I can't think of any reason not to add an abstract type. The Boolean case would be a high use case, for sure, but I probably don't have the bandwidth to actually add specialisations, as you suggest.

Just curious, in what context are you making use of the package?

rafaqz · 2023-11-30T08:07:30Z

Thanks, that's reasonable!

The context is helping @tiemvanderdeure write SpeciesDistributionModels.jl on top of MLJ.jl. SDMs predict presence or absence of species, so all boolean classifiers.

MLJ has been amazing for interop and standardization so far, UnivariateFinite has been one of few parts that is slow, and its also a little awkward to use.

(I was thinking about having a static version with keys in the types, it would be very fast in some contexts, even for non-Bool categories)

ablaom · 2023-11-30T19:32:56Z

UnivariateFinite has been one of few parts that is slow

Are you possibly constructing arrays of distributions by individually constructing each distribution, rather than calling the UnivariateFinite constructor with a probability array?

julia> probs = rand(10000);

julia> @btime UnivariateFinite($probs, augment=true, pool=missing);
  97.113 μs (241 allocations: 415.92 KiB)

julia> @btime [UnivariateFinite([probs[i],], augment=true, pool=missing) for i in 1:10000];
  278.524 ms (2369497 allocations: 142.36 MiB)

tiemvanderdeure · 2023-11-30T20:27:17Z

In this case Shapley.shapley is calling + on UnivariateFinites lots of times, which ends up being slower than necessary.

Which results in that calling pdf.(x, true) on a UnivariateFinite ends up before calling + ends up being much faster. I don't thnk it's constructing each distribution separately but I might be wrong?

import MLJ, CategoricalArrays, Shapley
RFC = MLJ.@load RandomForestClassifier pkg=DecisionTree
x = (a = rand(100), )
y = CategoricalArrays.categorical(x.a .> rand(100))

mach = MLJ.machine(RFC(), x, y)
MLJ.fit!(mach)

shap_wo_pdf(mach = mach, x = x) = Shapley.shapley(x -> MLJ.predict(mach, x), Shapley.MonteCarlo(100), x);
shap_w_pdf(mach = mach, x = x) = Shapley.shapley(x -> MLJ.pdf.(MLJ.predict(mach, x), true), Shapley.MonteCarlo(100), x);

ablaom · 2023-11-30T22:17:36Z

Okay, so my comment does not apply.

Since pdf.(::Univariate,FiniteArray) and +(::UnivariateFiniteArray, ::UnivariateFiniteArray) are optimised (the use of dictionaries notwithstanding) I don't see what else you could be doing better, then.

rafaqz · 2023-11-30T22:25:31Z

I feel like optimizations on UnivariateFiniteArray are often going to break like this, and UnivariateFinite is so heavy on its own outside of the array.

Is there a reason you store the category labels as runtime values, instead of putting them in the type like a NamedTuple? Then there would be less need for optimizations over the whole array.

ablaom · 2023-12-03T22:11:26Z

I think we had in mind the case of large class cardinality, which admittedly is not a big use case in the ML applications. But I'm not convinced that adding these as types would preclude the need for the wrapper for arrays, to get speed (which was terrible before the wrapper was added). Maybe, if you are also dumping the dictionaries, but then pdf lookup is going to be slower, right?

rafaqz · 2023-12-04T09:17:58Z

Ah right, yes the NamedTuple strategy only scales to ~50 classes. Its the classic DataFrames,jl/TypedTables.jl split. I guess just optimising Bool at least gets the very bottom end of the spectrum.

Sorry I don't know enough details about how pdf lookup is optimised to understand the benefits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `AbstractUnivariateFinite` for performance customisation #70

Add `AbstractUnivariateFinite` for performance customisation #70

rafaqz commented Nov 29, 2023 •

edited

Loading

ablaom commented Nov 29, 2023

rafaqz commented Nov 30, 2023 •

edited

Loading

ablaom commented Nov 30, 2023

tiemvanderdeure commented Nov 30, 2023

ablaom commented Nov 30, 2023

rafaqz commented Nov 30, 2023

ablaom commented Dec 3, 2023

rafaqz commented Dec 4, 2023

Add AbstractUnivariateFinite for performance customisation #70

Add AbstractUnivariateFinite for performance customisation #70

Comments

rafaqz commented Nov 29, 2023 • edited Loading

ablaom commented Nov 29, 2023

rafaqz commented Nov 30, 2023 • edited Loading

ablaom commented Nov 30, 2023

tiemvanderdeure commented Nov 30, 2023

ablaom commented Nov 30, 2023

rafaqz commented Nov 30, 2023

ablaom commented Dec 3, 2023

rafaqz commented Dec 4, 2023

Add `AbstractUnivariateFinite` for performance customisation #70

Add `AbstractUnivariateFinite` for performance customisation #70

rafaqz commented Nov 29, 2023 •

edited

Loading

rafaqz commented Nov 30, 2023 •

edited

Loading