Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AbstractUnivariateFinite for performance customisation #70

Open
rafaqz opened this issue Nov 29, 2023 · 8 comments
Open

Add AbstractUnivariateFinite for performance customisation #70

rafaqz opened this issue Nov 29, 2023 · 8 comments

Comments

@rafaqz
Copy link

rafaqz commented Nov 29, 2023

UnivariateFinite can be overkill for e.g. simple Bool categories.

It looks like they could be at least 50x faster to construct using a custom struct without the LittleDict.

Adding a AbstractUnivariateFinite would allow defining objects that specialised on type.

Currently, the profile using Shapley.jl with Bool categories looks like this:
billede

@ablaom
Copy link
Member

ablaom commented Nov 29, 2023

I can't think of any reason not to add an abstract type. The Boolean case would be a high use case, for sure, but I probably don't have the bandwidth to actually add specialisations, as you suggest.

Just curious, in what context are you making use of the package?

@rafaqz
Copy link
Author

rafaqz commented Nov 30, 2023

Thanks, that's reasonable!

The context is helping @tiemvanderdeure write SpeciesDistributionModels.jl on top of MLJ.jl. SDMs predict presence or absence of species, so all boolean classifiers.

MLJ has been amazing for interop and standardization so far, UnivariateFinite has been one of few parts that is slow, and its also a little awkward to use.

(I was thinking about having a static version with keys in the types, it would be very fast in some contexts, even for non-Bool categories)

@ablaom
Copy link
Member

ablaom commented Nov 30, 2023

UnivariateFinite has been one of few parts that is slow

Are you possibly constructing arrays of distributions by individually constructing each distribution, rather than calling the UnivariateFinite constructor with a probability array?

julia> probs = rand(10000);

julia> @btime UnivariateFinite($probs, augment=true, pool=missing);
  97.113 μs (241 allocations: 415.92 KiB)

julia> @btime [UnivariateFinite([probs[i],], augment=true, pool=missing) for i in 1:10000];
  278.524 ms (2369497 allocations: 142.36 MiB)

@tiemvanderdeure
Copy link
Contributor

In this case Shapley.shapley is calling + on UnivariateFinites lots of times, which ends up being slower than necessary.

Which results in that calling pdf.(x, true) on a UnivariateFinite ends up before calling + ends up being much faster. I don't thnk it's constructing each distribution separately but I might be wrong?

import MLJ, CategoricalArrays, Shapley
RFC = MLJ.@load RandomForestClassifier pkg=DecisionTree
x = (a = rand(100), )
y = CategoricalArrays.categorical(x.a .> rand(100))

mach = MLJ.machine(RFC(), x, y)
MLJ.fit!(mach)

shap_wo_pdf(mach = mach, x = x) = Shapley.shapley(x -> MLJ.predict(mach, x), Shapley.MonteCarlo(100), x);
shap_w_pdf(mach = mach, x = x) = Shapley.shapley(x -> MLJ.pdf.(MLJ.predict(mach, x), true), Shapley.MonteCarlo(100), x);

@ablaom
Copy link
Member

ablaom commented Nov 30, 2023

Okay, so my comment does not apply.

Since pdf.(::Univariate,FiniteArray) and +(::UnivariateFiniteArray, ::UnivariateFiniteArray) are optimised (the use of dictionaries notwithstanding) I don't see what else you could be doing better, then.

@rafaqz
Copy link
Author

rafaqz commented Nov 30, 2023

I feel like optimizations on UnivariateFiniteArray are often going to break like this, and UnivariateFinite is so heavy on its own outside of the array.

Is there a reason you store the category labels as runtime values, instead of putting them in the type like a NamedTuple? Then there would be less need for optimizations over the whole array.

@ablaom
Copy link
Member

ablaom commented Dec 3, 2023

I think we had in mind the case of large class cardinality, which admittedly is not a big use case in the ML applications. But I'm not convinced that adding these as types would preclude the need for the wrapper for arrays, to get speed (which was terrible before the wrapper was added). Maybe, if you are also dumping the dictionaries, but then pdf lookup is going to be slower, right?

@rafaqz
Copy link
Author

rafaqz commented Dec 4, 2023

Ah right, yes the NamedTuple strategy only scales to ~50 classes. Its the classic DataFrames,jl/TypedTables.jl split. I guess just optimising Bool at least gets the very bottom end of the spectrum.

Sorry I don't know enough details about how pdf lookup is optimised to understand the benefits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: priority low / straightforward
Development

No branches or pull requests

3 participants