Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance enhancement: freeze the internal LittleDicts #56

Open
ablaom opened this issue Apr 11, 2023 · 2 comments
Open

Performance enhancement: freeze the internal LittleDicts #56

ablaom opened this issue Apr 11, 2023 · 2 comments
Labels

Comments

@ablaom
Copy link
Member

ablaom commented Apr 11, 2023

OrderedCollections.jl has a freeze command to make a LittleDict immutable but faster.

@OkonSamuel

@ablaom ablaom added the good first issue Good for newcomers label Apr 11, 2023
@Jay-sanjay
Copy link

Jay-sanjay commented Oct 24, 2023

Hello @ablaom sir, I would like to work for this issue.
I see you want this function
https://github.com/JuliaCollections/OrderedCollections.jl/blob/c3687a87d7ce0d134bf3578bb85472c16e903012/src/little_dict.jl#L101-L107 from OrderedCollections.jl in this repo somewhere here

prob_given_ref::LittleDict{R,P,Vector{R}, Vector{P}}

Can you help me what further I am supposed to do , I will be more than happy to contribute !

@ablaom
Copy link
Member Author

ablaom commented Oct 24, 2023

Thanks @Jay-sanjay for the offer of help. Please understand that there's no guarantee that this will substantially speed things up, but I thought it might be interesting to investigate.

First thing is to establish some benchmarks, to make sure we don't make important things significantly worse. I've attached some suggested benchmarks at the end. So, have a look at those, get a little familiar with what is being benchmarked, and run the benchmarks locally on your machine.

Adding new type parameter C

The typing you mention,

prob_given_ref::LittleDict{R,P,Vector{R}, Vector{P}} 

will need to change to

prob_given_ref::LittleDict{R,P,NTuple{C,R}, NTuple{C,P}} 

where C is the number of classes in the support of the distribution (can be less than the size of the full categorical pool). This means we need to add C as a type parameter:

struct UnivariateFinite{C,S,V,R,P}    #  <---------------------- was {S,V,R,P}
    scitype::Type{S}
    decoder::CategoricalDecoder{V,R}
    prob_given_ref::LittleDict{R,P,Ntuple{C,R}, NTuple{C,P}}
end

and make a similar change to the UnivariateFiniteArray struct.

Everywhere type parameters appear in the code will need to be updated to account for the fact that we've added C. This could be a little tedious, but is most of the work.

Updating the constructors

All the constructors call a single dict-based constructor and we can just replace LittleDict(pairs...) in that constructor with freeze(LittleDict(pairs)). This occurs in two places here:

return UnivariateFiniteArray(S, parent_decoder, LittleDict(pairs...))

Later, we could try some further optimisations to construct the tuple-based dicts more directly from the keys and values of the supplied dictionary, but let's leave that for later.

Maybe some unit tests will need fixing but hopefully that's it.

Benchmarking code
using CategoricalArrays
using CategoricalDistributions
import Random
import BenchmarkTools as B
import Distributions as D
using Pkg

# for benchmark reproducability:
Random.seed!(123)

# number of samples to take in benchmarks; increase to lower σ in benchmarks, decrease if
# impatient:
S = 100

# size of data for fitting:
N = 10000

# size of multi-samples and `UnivariateFiniteVector`s
K = 10_000


pool = collect("abcde") |> categorical;
data = rand(pool, N);

bench(b) = B.run(b, samples=S, seconds=60)

println()
Main.versioninfo()
println()
Pkg.status()
println()

b = B.@benchmarkable D.fit(UnivariateFinite, $data)
@info "`fit`" bench(b)

true_probs = rand(K);
b = B.@benchmarkable UnivariateFinite($true_probs, augment=true, pool=missing)
@info "constructors - binary case, missing pool" bench(b)

binary_pool = [false, true] |> categorical
b = B.@benchmarkable UnivariateFinite($binary_pool, $true_probs, augment=true)
@info "constructors - binary case, labels specified" bench(b)

probs = rand(K, 5);
probs = probs ./ sum(probs, dims=2);
b = B.@benchmarkable UnivariateFinite($probs, pool=missing)
@info "constructors - multiclass case, missing pool" bench(b)

b = B.@benchmarkable UnivariateFinite($pool, $probs)
@info "constructors - binary case, labels specified" bench(b)

d = UnivariateFinite(pool, fill(0.2, 5));
b = B.@benchmarkable rand($d)
@info "`rand` - single sample" bench(b)

b = B.@benchmarkable rand($d, K)
@info "`rand` - multiple samples" bench(b)

x = 'a'
b = B.@benchmarkable pdf($d, $x)
@info "`pdf` - single distribution" bench(b)

d_vector = UnivariateFinite(pool, probs);
b = B.@benchmarkable pdf.($d_vector, $x)
@info "`pdf` - multi-distribution" bench(b)

b = B.@benchmarkable mode($d)
@info "`mode` - single distribution" bench(b)

b = B.@benchmarkable mode.($d_vector)
@info "`mode` - multi-distribution" bench(b)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: priority low / straightforward
Development

No branches or pull requests

2 participants