Performance enhancement: `freeze` the internal `LittleDicts` #56

ablaom · 2023-04-11T22:53:15Z

OrderedCollections.jl has a freeze command to make a LittleDict immutable but faster.

The text was updated successfully, but these errors were encountered:

Jay-sanjay · 2023-10-24T15:55:29Z

Hello @ablaom sir, I would like to work for this issue.
I see you want this function
https://github.com/JuliaCollections/OrderedCollections.jl/blob/c3687a87d7ce0d134bf3578bb85472c16e903012/src/little_dict.jl#L101-L107 from OrderedCollections.jl in this repo somewhere here

CategoricalDistributions.jl/src/types.jl

Line 174 in 5c19ce3

prob_given_ref::LittleDict{R,P,Vector{R}, Vector{P}}

Can you help me what further I am supposed to do , I will be more than happy to contribute !

ablaom · 2023-10-24T21:50:51Z

Thanks @Jay-sanjay for the offer of help. Please understand that there's no guarantee that this will substantially speed things up, but I thought it might be interesting to investigate.

First thing is to establish some benchmarks, to make sure we don't make important things significantly worse. I've attached some suggested benchmarks at the end. So, have a look at those, get a little familiar with what is being benchmarked, and run the benchmarks locally on your machine.

Adding new type parameter `C`

The typing you mention,

prob_given_ref::LittleDict{R,P,Vector{R}, Vector{P}}

will need to change to

prob_given_ref::LittleDict{R,P,NTuple{C,R}, NTuple{C,P}}

where C is the number of classes in the support of the distribution (can be less than the size of the full categorical pool). This means we need to add C as a type parameter:

struct UnivariateFinite{C,S,V,R,P}    #  <---------------------- was {S,V,R,P}
    scitype::Type{S}
    decoder::CategoricalDecoder{V,R}
    prob_given_ref::LittleDict{R,P,Ntuple{C,R}, NTuple{C,P}}
end

and make a similar change to the UnivariateFiniteArray struct.

Everywhere type parameters appear in the code will need to be updated to account for the fact that we've added C. This could be a little tedious, but is most of the work.

Updating the constructors

All the constructors call a single dict-based constructor and we can just replace LittleDict(pairs...) in that constructor with freeze(LittleDict(pairs)). This occurs in two places here:

CategoricalDistributions.jl/src/types.jl

Line 323 in 5c19ce3

return UnivariateFiniteArray(S, parent_decoder, LittleDict(pairs...))

Later, we could try some further optimisations to construct the tuple-based dicts more directly from the keys and values of the supplied dictionary, but let's leave that for later.

Maybe some unit tests will need fixing but hopefully that's it.

Benchmarking code

using CategoricalArrays
using CategoricalDistributions
import Random
import BenchmarkTools as B
import Distributions as D
using Pkg

# for benchmark reproducability:
Random.seed!(123)

# number of samples to take in benchmarks; increase to lower σ in benchmarks, decrease if
# impatient:
S = 100

# size of data for fitting:
N = 10000

# size of multi-samples and `UnivariateFiniteVector`s
K = 10_000


pool = collect("abcde") |> categorical;
data = rand(pool, N);

bench(b) = B.run(b, samples=S, seconds=60)

println()
Main.versioninfo()
println()
Pkg.status()
println()

b = B.@benchmarkable D.fit(UnivariateFinite, $data)
@info "`fit`" bench(b)

true_probs = rand(K);
b = B.@benchmarkable UnivariateFinite($true_probs, augment=true, pool=missing)
@info "constructors - binary case, missing pool" bench(b)

binary_pool = [false, true] |> categorical
b = B.@benchmarkable UnivariateFinite($binary_pool, $true_probs, augment=true)
@info "constructors - binary case, labels specified" bench(b)

probs = rand(K, 5);
probs = probs ./ sum(probs, dims=2);
b = B.@benchmarkable UnivariateFinite($probs, pool=missing)
@info "constructors - multiclass case, missing pool" bench(b)

b = B.@benchmarkable UnivariateFinite($pool, $probs)
@info "constructors - binary case, labels specified" bench(b)

d = UnivariateFinite(pool, fill(0.2, 5));
b = B.@benchmarkable rand($d)
@info "`rand` - single sample" bench(b)

b = B.@benchmarkable rand($d, K)
@info "`rand` - multiple samples" bench(b)

x = 'a'
b = B.@benchmarkable pdf($d, $x)
@info "`pdf` - single distribution" bench(b)

d_vector = UnivariateFinite(pool, probs);
b = B.@benchmarkable pdf.($d_vector, $x)
@info "`pdf` - multi-distribution" bench(b)

b = B.@benchmarkable mode($d)
@info "`mode` - single distribution" bench(b)

b = B.@benchmarkable mode.($d_vector)
@info "`mode` - multi-distribution" bench(b)

ablaom added the good first issue Good for newcomers label Apr 11, 2023

ablaom added the optimization label Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance enhancement: `freeze` the internal `LittleDicts` #56

Performance enhancement: `freeze` the internal `LittleDicts` #56

ablaom commented Apr 11, 2023

Jay-sanjay commented Oct 24, 2023 •

edited

Loading

ablaom commented Oct 24, 2023

Performance enhancement: freeze the internal LittleDicts #56

Performance enhancement: freeze the internal LittleDicts #56

Comments

ablaom commented Apr 11, 2023

Jay-sanjay commented Oct 24, 2023 • edited Loading

ablaom commented Oct 24, 2023

Adding new type parameter C

Updating the constructors

Performance enhancement: `freeze` the internal `LittleDicts` #56

Performance enhancement: `freeze` the internal `LittleDicts` #56

Jay-sanjay commented Oct 24, 2023 •

edited

Loading

Adding new type parameter `C`