Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing against Contreras Rodríguez et al. 2021 #11

Open
nluetts opened this issue Oct 11, 2024 · 2 comments
Open

Testing against Contreras Rodríguez et al. 2021 #11

nluetts opened this issue Oct 11, 2024 · 2 comments

Comments

@nluetts
Copy link

nluetts commented Oct 11, 2024

Hi @kellino 👋

this is Nils, I am opening this in response to the review over at JOSS
(openjournals/joss-reviews#7334).

From your JOSS paper draft I got the impression that the paper from Contreras
Rodrı́guez et al. (https://doi.org/10.3390/e23050561) is quite central for your
package, so I thought testing your package against the estimated entropies
published there would be a good thing to do (which does not yet seem to be
covered by your test cases, although they are quite comprehensive otherwise).

In the paper, they estimate entropies for byte sequences they generated on
a Linux machine with /dev/urandom, so I did the same and tried to
reproduce their Table 1 with DiscreteEntropy.jl.

I used this Julia environment:

[nluetts@fedora DiscreteEntropy-JOSS-Review]$ julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.5 (2024-08-27)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(DiscreteEntropy-JOSS-Review) pkg> status
Status `/var/home/nluetts/Repos/DiscreteEntropy-JOSS-Review/Project.toml`
  [940648b6] DiscreteEntropy v0.2.0
  [08abe8d2] PrettyTables v2.4.0
  [10745b16] Statistics v1.10.0

And the following script to do the comparison:

using DiscreteEntropy
using Statistics
using PrettyTables

function main()
    results = Dict{String,Vector{Float64}}()
    for est_label in keys(REF_VALUES)
        println("###############################################################################")
        println()
        est = ESTIMATORS[est_label]
        if isnothing(est)
            @warn "No estimator for \"$(est_label)\", skipping ..."
            # so that results table is not missing a row from the paper
            results[est_label] = repeat([NaN64], 12)
            continue
        end

        # less draws for slower estimators
        draws = est == BUB ? 100 : 1000
        draws = est == NSB ? 25 : draws

        println("Running comparison for $est ...")

        mean_entropies = Float64[]
        for exp in 3:14
            println("Sample size: $(2^exp)")
            try
                entropy = compare_contreras_rodriguez(est, exp; draws=draws)
                push!(mean_entropies, entropy)
                # index in list of entropy values = exp - 2
                # because Contreras Rodríguez et al. starts with
                # sample size 2^3 = 8
                ref_entropy = REF_VALUES[est_label][exp-2]
                println("Reference mean entropy: $ref_entropy, DiscreteEntropy.jl: $entropy")
            catch e
                println("ERROR running estimation: $e")
                push!(mean_entropies, NaN64)
            end
        end
        results[est_label] = mean_entropies
    end

    labels = SORTED_ESTIMATOR_LABELS
    n = length(labels)
    # since PrettyTables cannot transpose, we have
    # to do some work ourselves here ... 
    estimates = [round(results[k][exp-2], digits=3) for exp in 3:14 for k in labels]
    deviations = [
        round(results[k][exp-2] - REF_VALUES[k][exp-2], digits=2)
        for exp in 3:14 for k in labels
    ]
    raw_table_results = hcat(labels, reshape(estimates, n, :))
    raw_table_deviations = hcat(labels, reshape(deviations, n, :))
    header = vcat(["Estimator"], [2^n for n in 3:14])

    # print out tables
    pretty_table(raw_table_results; header=header),
    pretty_table(raw_table_deviations; header=header)
end

# From Contreras Rodríguez et al.,
# Entropy 2021, 23, 561. https://doi.org/10.3390/e23050561
# Table 2: "Estimated mean of estimates for byte samples."
REF_VALUES = Dict(
    "CS" => [6.142, 7.635, 8.251, 8.126, 8.064, 8.099, 8.116, 8.003, 7.914, 7.940, 7.963, 7.974],
    "ML" => [2.969, 3.943, 4.880, 5.768, 6.550, 7.178, 7.591, 7.809, 7.908, 7.955, 7.977, 7.989],
    "MM" => [3.590, 4.598, 5.536, 6.397, 7.113, 7.633, 7.902, 7.985, 7.998, 7.999, 8.000, 8.000],
    "SG" => [3.999, 4.470, 5.136, 5.884, 6.598, 7.194, 7.595, 7.809, 7.908, 7.955, 7.977, 7.989],
    "SHR" => [7.940, 7.971, 7.982, 7.991, 7.995, 7.998, 7.999, 7.999, 8.000, 8.000, 8.000, 8.000],
    "JEF" => [7.947, 7.902, 7.834, 7.751, 7.690, 7.696, 7.767, 7.854, 7.919, 7.957, 7.978, 7.989],
    "LAP" => [7.983, 7.969, 7.943, 7.904, 7.859, 7.832, 7.842, 7.883, 7.928, 7.960, 7.979, 7.989],
    "MIN" => [5.048, 5.385, 5.835, 6.357, 6.875, 7.326, 7.643, 7.823, 7.912, 7.956, 7.978, 7.989],
    "BUB" => [7.008, 7.021, 7.476, 7.732, 7.868, 8.061, 7.896, 7.971, 7.983, 7.985, 7.985, 7.985],
    "NSB" => [6.427, 7.065, 7.474, 7.726, 7.855, 7.922, 7.953, 7.969, 7.977, 7.981, 7.983, 7.984],
    "Unseen" => [5.202, 6.751, 7.739, 7.925, 7.945, 7.957, 7.950, 7.948, 7.953, 7.960, 7.966, 7.971],
    "Zhang" => [3.690, 4.696, 5.627, 6.478, 7.178, 7.674, 7.916, 7.979, 7.985, 7.985, 7.985, 7.985],
    "CJ" => [5.349, 7.028, 8.089, 8.093, 8.004, 7.996, 7.987, 7.985, 7.985, 7.985, 7.985, 7.985],
    "UnveilJ" => [3.036, 4.124, 5.316, 6.645, 7.749, 7.95, 7.855, 7.832, 7.894, 7.940, 7.963, 7.974],
    "BN" => [2.963, 4.433, 5.815, 6.981, 7.726, 7.889, 7.41, 6.662, 6.115, 5.829, 5.686, 5.616],
    "GSB" => [4.646, 5.613, 6.464, 7.182, 7.670, 7.921, 7.981, 7.985, 7.985, 7.985, 7.985, 7.985],
    "SHU" => [4.347, 5.329, 6.210, 6.978, 7.541, 7.868, 7.972, 7.985, 7.985, 7.985, 7.985, 7.985],
    "CDM" => [6.165, 6.908, 7.391, 7.689, 7.844, 7.922, 7.961, 7.981, 7.990, 7.995, 7.998, 7.999],
)

# same sorting as in paper (not preserved in Dict `REF_VALUES`)
SORTED_ESTIMATOR_LABELS = ["CS", "ML", "MM", "SG", "SHR", "JEF", "LAP", "MIN", "BUB", "NSB", "Unseen", "Zhang", "CJ", "UnveilJ", "BN", "GSB", "SHU", "CDM"]

ESTIMATORS = Dict(
    "CS" => ChaoShen,
    "ML" => MaximumLikelihood,
    "MM" => MillerMadow,
    "SG" => SchurmannGrassberger,
    "SHR" => Shrink,
    "JEF" => Jeffrey,
    "LAP" => LaPlace,
    "MIN" => Minimax,
    "BUB" => BUB,
    "NSB" => NSB,
    "Unseen" => Unseen,
    "Zhang" => Zhang,
    # Contreras Rodríguez et al. seem to have a typo here, CJ is not mentioned
    # in Table 1, only CWJ, so I am not sure this mapping is correct.
    "CJ" => ChaoWangJost,
    "UnveilJ" => JackknifeMLE,
    "BN" => Bonachela,
    "GSB" => Grassberger,
    "SHU" => Schurmann,
    "CDM" => nothing # Cannot figure out which one this is
)

function read_dev_urandom_bytes(n::UInt)
    bytes = open("/dev/urandom", "r") do io
        read(io, n)
    end
    bytes
end

read_dev_urandom_bytes(n) = read_dev_urandom_bytes(convert(UInt, n))

function compare_contreras_rodriguez(::Type{T}, bytes_exponent::UInt8; draws=1000) where {T<:AbstractEstimator}
    @assert 3 <= bytes_exponent <= 14 "No reference values available for sample size $(2^bytes_exponent)"
    sample_size = 2^bytes_exponent
    mean(
        estimate_h(read_dev_urandom_bytes(sample_size) |> svector |> from_samples, T)
        for _ in 1:draws
    ) |> to_bits
end

compare_contreras_rodriguez(::Type{T}, bytes_exponent; draws=1000) where {T<:AbstractEstimator} =
    compare_contreras_rodriguez(T, convert(UInt8, bytes_exponent), draws=draws)

The script generates the following output (scroll to the end to find the two relevant tables):

[nluetts@fedora DiscreteEntropy-JOSS-Review]$ julia --project --load comparison.jl -e "main()"
###############################################################################

Running comparison for BUB ...
Sample size: 8
Reference mean entropy: 7.008, DiscreteEntropy.jl: 3.614995850373481
Sample size: 16
Reference mean entropy: 7.021, DiscreteEntropy.jl: 4.7899950608286606
Sample size: 32
Reference mean entropy: 7.476, DiscreteEntropy.jl: 6.796319151130138
Sample size: 64
Reference mean entropy: 7.732, DiscreteEntropy.jl: 7.572799602723234
Sample size: 128
Reference mean entropy: 7.868, DiscreteEntropy.jl: 7.9792128580842085
Sample size: 256
Reference mean entropy: 8.061, DiscreteEntropy.jl: 8.058072331938295
Sample size: 512
Reference mean entropy: 7.896, DiscreteEntropy.jl: 8.001327375927534
Sample size: 1024
Reference mean entropy: 7.971, DiscreteEntropy.jl: 7.99473597067706
Sample size: 2048
Reference mean entropy: 7.983, DiscreteEntropy.jl: 8.000663690919412
Sample size: 4096
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.9938103137202425
Sample size: 8192
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999977554911958
Sample size: 16384
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000005950701526
###############################################################################

Running comparison for ChaoWangJost ...
Sample size: 8
Reference mean entropy: 5.349, DiscreteEntropy.jl: 5.360278389970844
Sample size: 16
Reference mean entropy: 7.028, DiscreteEntropy.jl: 7.041184521071105
Sample size: 32
Reference mean entropy: 8.089, DiscreteEntropy.jl: 8.118867698781537
Sample size: 64
Reference mean entropy: 8.093, DiscreteEntropy.jl: 8.08072337036545
Sample size: 128
Reference mean entropy: 8.004, DiscreteEntropy.jl: 8.01966232891067
Sample size: 256
Reference mean entropy: 7.996, DiscreteEntropy.jl: 8.008443797783935
Sample size: 512
Reference mean entropy: 7.987, DiscreteEntropy.jl: 8.000142338203254
Sample size: 1024
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000489112501155
Sample size: 2048
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.00003292715623
Sample size: 4096
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999791811121944
Sample size: 8192
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999968198405732
Sample size: 16384
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.9999736567289865
###############################################################################

Running comparison for Minimax ...
Sample size: 8
Reference mean entropy: 5.048, DiscreteEntropy.jl: 2.9761034489686033
Sample size: 16
Reference mean entropy: 5.385, DiscreteEntropy.jl: 3.9442625156600024
Sample size: 32
Reference mean entropy: 5.835, DiscreteEntropy.jl: 4.89343609826837
Sample size: 64
Reference mean entropy: 6.357, DiscreteEntropy.jl: 5.783872751201355
Sample size: 128
Reference mean entropy: 6.875, DiscreteEntropy.jl: 6.565523196303405
Sample size: 256
Reference mean entropy: 7.326, DiscreteEntropy.jl: 7.196160029923576
Sample size: 512
Reference mean entropy: 7.643, DiscreteEntropy.jl: 7.606140946314727
Sample size: 1024
Reference mean entropy: 7.823, DiscreteEntropy.jl: 7.819207596467573
Sample size: 2048
Reference mean entropy: 7.912, DiscreteEntropy.jl: 7.912123086208508
Sample size: 4096
Reference mean entropy: 7.956, DiscreteEntropy.jl: 7.956119491898474
Sample size: 8192
Reference mean entropy: 7.978, DiscreteEntropy.jl: 7.978118017008263
Sample size: 16384
Reference mean entropy: 7.989, DiscreteEntropy.jl: 7.988922826292928
###############################################################################

Running comparison for Grassberger ...
Sample size: 8
Reference mean entropy: 4.646, DiscreteEntropy.jl: 4.475872575973066
Sample size: 16
Reference mean entropy: 5.613, DiscreteEntropy.jl: 5.395986343698666
Sample size: 32
Reference mean entropy: 6.464, DiscreteEntropy.jl: 6.260004822058096
Sample size: 64
Reference mean entropy: 7.182, DiscreteEntropy.jl: 6.980796744618755
Sample size: 128
Reference mean entropy: 7.67, DiscreteEntropy.jl: 7.5182629313850935
Sample size: 256
Reference mean entropy: 7.921, DiscreteEntropy.jl: 7.827305281488433
Sample size: 512
Reference mean entropy: 7.981, DiscreteEntropy.jl: 7.961604461074284
Sample size: 1024
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.995547205459536
Sample size: 2048
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000232539142349
Sample size: 4096
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000359491911881
Sample size: 8192
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999967625418502
Sample size: 16384
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.9998609625723525
###############################################################################

Running comparison for Jeffrey ...
Sample size: 8
Reference mean entropy: 7.947, DiscreteEntropy.jl: 2.972254402351613
Sample size: 16
Reference mean entropy: 7.902, DiscreteEntropy.jl: 3.9547463784273966
Sample size: 32
Reference mean entropy: 7.834, DiscreteEntropy.jl: 4.8988101907857375
Sample size: 64
Reference mean entropy: 7.751, DiscreteEntropy.jl: 5.796691217187817
Sample size: 128
Reference mean entropy: 7.69, DiscreteEntropy.jl: 6.595211026669902
Sample size: 256
Reference mean entropy: 7.696, DiscreteEntropy.jl: 7.2428089132809506
Sample size: 512
Reference mean entropy: 7.767, DiscreteEntropy.jl: 7.6545957368144855
Sample size: 1024
Reference mean entropy: 7.854, DiscreteEntropy.jl: 7.844389558269971
Sample size: 2048
Reference mean entropy: 7.919, DiscreteEntropy.jl: 7.918733289952589
Sample size: 4096
Reference mean entropy: 7.957, DiscreteEntropy.jl: 7.957507646842208
Sample size: 8192
Reference mean entropy: 7.978, DiscreteEntropy.jl: 7.978231591590087
Sample size: 16384
Reference mean entropy: 7.989, DiscreteEntropy.jl: 7.988930604764357
###############################################################################

Running comparison for LaPlace ...
Sample size: 8
Reference mean entropy: 7.983, DiscreteEntropy.jl: 2.9736909829661333
Sample size: 16
Reference mean entropy: 7.969, DiscreteEntropy.jl: 3.9533018294946505
Sample size: 32
Reference mean entropy: 7.943, DiscreteEntropy.jl: 4.903365419183482
Sample size: 64
Reference mean entropy: 7.904, DiscreteEntropy.jl: 5.808311331087142
Sample size: 128
Reference mean entropy: 7.859, DiscreteEntropy.jl: 6.621930848735704
Sample size: 256
Reference mean entropy: 7.832, DiscreteEntropy.jl: 7.27588208048709
Sample size: 512
Reference mean entropy: 7.842, DiscreteEntropy.jl: 7.692674035956105
Sample size: 1024
Reference mean entropy: 7.883, DiscreteEntropy.jl: 7.869540223819323
Sample size: 2048
Reference mean entropy: 7.928, DiscreteEntropy.jl: 7.927973374245435
Sample size: 4096
Reference mean entropy: 7.96, DiscreteEntropy.jl: 7.959961271968038
Sample size: 8192
Reference mean entropy: 7.979, DiscreteEntropy.jl: 7.9786682763551084
Sample size: 16384
Reference mean entropy: 7.989, DiscreteEntropy.jl: 7.989102788616696
###############################################################################

Running comparison for Shrink ...
Sample size: 8
Reference mean entropy: 7.94, DiscreteEntropy.jl: 2.9808091479101586
Sample size: 16
Reference mean entropy: 7.971, DiscreteEntropy.jl: 3.953894179273727
Sample size: 32
Reference mean entropy: 7.982, DiscreteEntropy.jl: 4.911835316261442
Sample size: 64
Reference mean entropy: 7.991, DiscreteEntropy.jl: 5.821070473660774
Sample size: 128
Reference mean entropy: 7.995, DiscreteEntropy.jl: 6.655003767983381
Sample size: 256
Reference mean entropy: 7.998, DiscreteEntropy.jl: 7.329850657777553
Sample size: 512
Reference mean entropy: 7.999, DiscreteEntropy.jl: 7.758420052803192
Sample size: 1024
Reference mean entropy: 7.999, DiscreteEntropy.jl: 7.929509738707598
Sample size: 2048
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.966671297879019
Sample size: 4096
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.983602410097988
Sample size: 8192
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.992357342562944
Sample size: 16384
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.9966348604391735
###############################################################################

Running comparison for SchurmannGrassberger ...
Sample size: 8
Reference mean entropy: 3.999, DiscreteEntropy.jl: 2.971736967454633
Sample size: 16
Reference mean entropy: 4.47, DiscreteEntropy.jl: 3.944955929980586
Sample size: 32
Reference mean entropy: 5.136, DiscreteEntropy.jl: 4.8825236480803875
Sample size: 64
Reference mean entropy: 5.884, DiscreteEntropy.jl: 5.7679317080531565
Sample size: 128
Reference mean entropy: 6.598, DiscreteEntropy.jl: 6.549177687335087
Sample size: 256
Reference mean entropy: 7.194, DiscreteEntropy.jl: 7.176950412445397
Sample size: 512
Reference mean entropy: 7.595, DiscreteEntropy.jl: 7.591266466353353
Sample size: 1024
Reference mean entropy: 7.809, DiscreteEntropy.jl: 7.808764456449525
Sample size: 2048
Reference mean entropy: 7.908, DiscreteEntropy.jl: 7.908099032251887
Sample size: 4096
Reference mean entropy: 7.955, DiscreteEntropy.jl: 7.954910030058388
Sample size: 8192
Reference mean entropy: 7.977, DiscreteEntropy.jl: 7.977386074854952
Sample size: 16384
Reference mean entropy: 7.989, DiscreteEntropy.jl: 7.988781719453369
###############################################################################

Running comparison for Zhang ...
Sample size: 8
Reference mean entropy: 3.69, DiscreteEntropy.jl: 3.70057718590882
Sample size: 16
Reference mean entropy: 4.696, DiscreteEntropy.jl: 4.708114791136625
Sample size: 32
Reference mean entropy: 5.627, DiscreteEntropy.jl: 5.640825481842562
Sample size: 64
Reference mean entropy: 6.478, DiscreteEntropy.jl: 6.489832547884759
Sample size: 128
Reference mean entropy: 7.178, DiscreteEntropy.jl: 7.185693307022013
Sample size: 256
Reference mean entropy: 7.674, DiscreteEntropy.jl: 7.685618818498393
Sample size: 512
Reference mean entropy: 7.916, DiscreteEntropy.jl: 7.931552076659486
Sample size: 1024
Reference mean entropy: 7.979, DiscreteEntropy.jl: 7.995478619546022
Sample size: 2048
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.00011685326433
Sample size: 4096
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.0002027748647
Sample size: 8192
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999933474194657
Sample size: 16384
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.999973102305582
###############################################################################

Running comparison for Bonachela ...
Sample size: 8
Reference mean entropy: 2.963, DiscreteEntropy.jl: 3.258906518604906
Sample size: 16
Reference mean entropy: 4.433, DiscreteEntropy.jl: 5.008529414146986
Sample size: 32
Reference mean entropy: 5.815, DiscreteEntropy.jl: 6.828142608055729
Sample size: 64
Reference mean entropy: 6.981, DiscreteEntropy.jl: 8.479980699266957
Sample size: 128
Reference mean entropy: 7.726, DiscreteEntropy.jl: 9.7162166211824
Sample size: 256
Reference mean entropy: 7.889, DiscreteEntropy.jl: 10.242244993398158
Sample size: 512
Reference mean entropy: 7.41, DiscreteEntropy.jl: 9.956818207508285
Sample size: 1024
Reference mean entropy: 6.662, DiscreteEntropy.jl: 9.221113291455794
Sample size: 2048
Reference mean entropy: 6.115, DiscreteEntropy.jl: 8.632933243927424
Sample size: 4096
Reference mean entropy: 5.829, DiscreteEntropy.jl: 8.316754014736388
Sample size: 8192
Reference mean entropy: 5.686, DiscreteEntropy.jl: 8.158346741992037
Sample size: 16384
Reference mean entropy: 5.616, DiscreteEntropy.jl: 8.079233029603056
###############################################################################

Running comparison for MillerMadow ...
Sample size: 8
Reference mean entropy: 3.59, DiscreteEntropy.jl: 3.592985854164966
Sample size: 16
Reference mean entropy: 4.598, DiscreteEntropy.jl: 4.5973992117016325
Sample size: 32
Reference mean entropy: 5.536, DiscreteEntropy.jl: 5.5378266814833585
Sample size: 64
Reference mean entropy: 6.397, DiscreteEntropy.jl: 6.3944223125125115
Sample size: 128
Reference mean entropy: 7.113, DiscreteEntropy.jl: 7.116156496268463
Sample size: 256
Reference mean entropy: 7.633, DiscreteEntropy.jl: 7.6320174448968015
Sample size: 512
Reference mean entropy: 7.902, DiscreteEntropy.jl: 7.903183965765673
Sample size: 1024
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.984768837092095
Sample size: 2048
Reference mean entropy: 7.998, DiscreteEntropy.jl: 7.99727572744489
Sample size: 4096
Reference mean entropy: 7.999, DiscreteEntropy.jl: 7.999653223504991
Sample size: 8192
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.9998848638654705
Sample size: 16384
Reference mean entropy: 8.0, DiscreteEntropy.jl: 7.999995299779101
###############################################################################

Running comparison for MaximumLikelihood ...
Sample size: 8
Reference mean entropy: 2.969, DiscreteEntropy.jl: 2.97425000000003
Sample size: 16
Reference mean entropy: 3.943, DiscreteEntropy.jl: 3.9426390976555274
Sample size: 32
Reference mean entropy: 4.88, DiscreteEntropy.jl: 4.877901691010165
Sample size: 64
Reference mean entropy: 5.768, DiscreteEntropy.jl: 5.767319630991382
Sample size: 128
Reference mean entropy: 6.55, DiscreteEntropy.jl: 6.5509600442318
Sample size: 256
Reference mean entropy: 7.178, DiscreteEntropy.jl: 7.1746188465355125
Sample size: 512
Reference mean entropy: 7.591, DiscreteEntropy.jl: 7.591458230569615
Sample size: 1024
Reference mean entropy: 7.809, DiscreteEntropy.jl: 7.808054042355388
Sample size: 2048
Reference mean entropy: 7.908, DiscreteEntropy.jl: 7.907547838590217
Sample size: 4096
Reference mean entropy: 7.955, DiscreteEntropy.jl: 7.954778160636422
Sample size: 8192
Reference mean entropy: 7.977, DiscreteEntropy.jl: 7.977414005402187
Sample size: 16384
Reference mean entropy: 7.989, DiscreteEntropy.jl: 7.988727784134597
###############################################################################

Running comparison for Schurmann ...
Sample size: 8
Reference mean entropy: 4.347, DiscreteEntropy.jl: 4.406152515153422
Sample size: 16
Reference mean entropy: 5.329, DiscreteEntropy.jl: 5.371441079483403
Sample size: 32
Reference mean entropy: 6.21, DiscreteEntropy.jl: 6.258362134412951
Sample size: 64
Reference mean entropy: 6.978, DiscreteEntropy.jl: 7.009485345023251
Sample size: 128
Reference mean entropy: 7.541, DiscreteEntropy.jl: 7.571587134463173
Sample size: 256
Reference mean entropy: 7.868, DiscreteEntropy.jl: 7.890860309539617
Sample size: 512
Reference mean entropy: 7.972, DiscreteEntropy.jl: 7.988336933785518
Sample size: 1024
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.99878737682248
Sample size: 2048
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000104914446608
Sample size: 4096
Reference mean entropy: 7.985, DiscreteEntropy.jl: 7.99987694067543
Sample size: 8192
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000120087586403
Sample size: 16384
Reference mean entropy: 7.985, DiscreteEntropy.jl: 8.000031759411705
###############################################################################

Running comparison for ChaoShen ...
Sample size: 8
Reference mean entropy: 6.142, DiscreteEntropy.jl: 6.175011056668875
Sample size: 16
Reference mean entropy: 7.635, DiscreteEntropy.jl: 7.625722852594498
Sample size: 32
Reference mean entropy: 8.251, DiscreteEntropy.jl: 8.324936762316629
Sample size: 64
Reference mean entropy: 8.126, DiscreteEntropy.jl: 8.130934934222017
Sample size: 128
Reference mean entropy: 8.064, DiscreteEntropy.jl: 8.0807378701635
Sample size: 256
Reference mean entropy: 8.099, DiscreteEntropy.jl: 8.114032732196659
Sample size: 512
Reference mean entropy: 8.116, DiscreteEntropy.jl: 8.130757598494167
Sample size: 1024
Reference mean entropy: 8.003, DiscreteEntropy.jl: 8.017750504080228
Sample size: 2048
Reference mean entropy: 7.914, DiscreteEntropy.jl: 7.929019945213538
Sample size: 4096
Reference mean entropy: 7.94, DiscreteEntropy.jl: 7.9546912060882935
Sample size: 8192
Reference mean entropy: 7.963, DiscreteEntropy.jl: 7.977376272281654
Sample size: 16384
Reference mean entropy: 7.974, DiscreteEntropy.jl: 7.988768967507508
###############################################################################

Running comparison for NSB ...
Sample size: 8
ERROR running estimation: Roots.ConvergenceFailed("Algorithm failed to converge")
Sample size: 16
ERROR running estimation: Roots.ConvergenceFailed("Algorithm failed to converge")
Sample size: 32
ERROR running estimation: Roots.ConvergenceFailed("Algorithm failed to converge")
Sample size: 64
Reference mean entropy: 7.726, DiscreteEntropy.jl: 5.819025242186299
Sample size: 128
Reference mean entropy: 7.855, DiscreteEntropy.jl: 6.650152910029301
Sample size: 256
Reference mean entropy: 7.922, DiscreteEntropy.jl: 7.327239500418171
Sample size: 512
Reference mean entropy: 7.953, DiscreteEntropy.jl: 7.75411110054627
Sample size: 1024
Reference mean entropy: 7.969, DiscreteEntropy.jl: 7.940917656786869
Sample size: 2048
Reference mean entropy: 7.977, DiscreteEntropy.jl: 7.967618652904784
Sample size: 4096
Reference mean entropy: 7.981, DiscreteEntropy.jl: 7.975301195389316
Sample size: 8192
Reference mean entropy: 7.983, DiscreteEntropy.jl: 7.983648977913771
Sample size: 16384
Reference mean entropy: 7.984, DiscreteEntropy.jl: 7.990635829134803
###############################################################################

Running comparison for JackknifeMLE ...
Sample size: 8
Reference mean entropy: 3.036, DiscreteEntropy.jl: 2.7853946642449614
Sample size: 16
Reference mean entropy: 4.124, DiscreteEntropy.jl: 3.8558508962325395
Sample size: 32
Reference mean entropy: 5.316, DiscreteEntropy.jl: 4.835834971119449
Sample size: 64
Reference mean entropy: 6.645, DiscreteEntropy.jl: 5.745386104854185
Sample size: 128
Reference mean entropy: 7.749, DiscreteEntropy.jl: 6.540863398829251
Sample size: 256
Reference mean entropy: 7.95, DiscreteEntropy.jl: 7.1722380222088455
Sample size: 512
Reference mean entropy: 7.855, DiscreteEntropy.jl: 7.589458477649767
Sample size: 1024
Reference mean entropy: 7.832, DiscreteEntropy.jl: 7.809161934211856
Sample size: 2048
Reference mean entropy: 7.894, DiscreteEntropy.jl: 7.907587677473503
Sample size: 4096
Reference mean entropy: 7.94, DiscreteEntropy.jl: 7.95460261651072
Sample size: 8192
Reference mean entropy: 7.963, DiscreteEntropy.jl: 7.977504117231058
Sample size: 16384
Reference mean entropy: 7.974, DiscreteEntropy.jl: 7.9887301921698
###############################################################################

Running comparison for Unseen ...
Sample size: 8
Reference mean entropy: 5.202, DiscreteEntropy.jl: -1.2256381298995431
Sample size: 16
Reference mean entropy: 6.751, DiscreteEntropy.jl: -0.8932659978524178
Sample size: 32
Reference mean entropy: 7.739, DiscreteEntropy.jl: -0.8418141110014484
Sample size: 64
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 3")
Sample size: 128
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 3")
Sample size: 256
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 4")
Sample size: 512
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 6")
Sample size: 1024
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 9")
Sample size: 2048
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 8")
Sample size: 4096
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 4")
Sample size: 8192
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 4")
Sample size: 16384
ERROR running estimation: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 3")
###############################################################################

┌ Warning: No estimator for "CDM", skipping ...
└ @ Main /var/home/nluetts/Repos/DiscreteEntropy-JOSS-Review/comparison.jl:12
┌───────────┬────────┬────────┬────────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ Estimator │      8 │     16 │     32 │    64 │   128 │    256 │   512 │  1024 │  2048 │  4096 │  8192 │ 16384 │
├───────────┼────────┼────────┼────────┼───────┼───────┼────────┼───────┼───────┼───────┼───────┼───────┼───────┤
│        CS │  6.175 │  7.626 │  8.325 │ 8.131 │ 8.081 │  8.114 │ 8.131 │ 8.018 │ 7.929 │ 7.955 │ 7.977 │ 7.989 │
│        ML │  2.974 │  3.943 │  4.878 │ 5.767 │ 6.551 │  7.175 │ 7.591 │ 7.808 │ 7.908 │ 7.955 │ 7.977 │ 7.989 │
│        MM │  3.593 │  4.597 │  5.538 │ 6.394 │ 7.116 │  7.632 │ 7.903 │ 7.985 │ 7.997 │   8.0 │   8.0 │   8.0 │
│        SG │  2.972 │  3.945 │  4.883 │ 5.768 │ 6.549 │  7.177 │ 7.591 │ 7.809 │ 7.908 │ 7.955 │ 7.977 │ 7.989 │
│       SHR │  2.981 │  3.954 │  4.912 │ 5.821 │ 6.655 │   7.33 │ 7.758 │  7.93 │ 7.967 │ 7.984 │ 7.992 │ 7.997 │
│       JEF │  2.972 │  3.955 │  4.899 │ 5.797 │ 6.595 │  7.243 │ 7.655 │ 7.844 │ 7.919 │ 7.958 │ 7.978 │ 7.989 │
│       LAP │  2.974 │  3.953 │  4.903 │ 5.808 │ 6.622 │  7.276 │ 7.693 │  7.87 │ 7.928 │  7.96 │ 7.979 │ 7.989 │
│       MIN │  2.976 │  3.944 │  4.893 │ 5.784 │ 6.566 │  7.196 │ 7.606 │ 7.819 │ 7.912 │ 7.956 │ 7.978 │ 7.989 │
│       BUB │  3.615 │   4.79 │  6.796 │ 7.573 │ 7.979 │  8.058 │ 8.001 │ 7.995 │ 8.001 │ 7.994 │   8.0 │   8.0 │
│       NSB │    NaN │    NaN │    NaN │ 5.819 │  6.65 │  7.327 │ 7.754 │ 7.941 │ 7.968 │ 7.975 │ 7.984 │ 7.991 │
│    Unseen │ -1.226 │ -0.893 │ -0.842 │   NaN │   NaN │    NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │
│     Zhang │  3.701 │  4.708 │  5.641 │  6.49 │ 7.186 │  7.686 │ 7.932 │ 7.995 │   8.0 │   8.0 │   8.0 │   8.0 │
│        CJ │   5.36 │  7.041 │  8.119 │ 8.081 │  8.02 │  8.008 │   8.0 │   8.0 │   8.0 │   8.0 │   8.0 │   8.0 │
│   UnveilJ │  2.785 │  3.856 │  4.836 │ 5.745 │ 6.541 │  7.172 │ 7.589 │ 7.809 │ 7.908 │ 7.955 │ 7.978 │ 7.989 │
│        BN │  3.259 │  5.009 │  6.828 │  8.48 │ 9.716 │ 10.242 │ 9.957 │ 9.221 │ 8.633 │ 8.317 │ 8.158 │ 8.079 │
│       GSB │  4.476 │  5.396 │   6.26 │ 6.981 │ 7.518 │  7.827 │ 7.962 │ 7.996 │   8.0 │   8.0 │   8.0 │   8.0 │
│       SHU │  4.406 │  5.371 │  6.258 │ 7.009 │ 7.572 │  7.891 │ 7.988 │ 7.999 │   8.0 │   8.0 │   8.0 │   8.0 │
│       CDM │    NaN │    NaN │    NaN │   NaN │   NaN │    NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │
└───────────┴────────┴────────┴────────┴───────┴───────┴────────┴───────┴───────┴───────┴───────┴───────┴───────┘
┌───────────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ Estimator │     8 │    16 │    32 │    64 │   128 │   256 │   512 │  1024 │  2048 │  4096 │  8192 │ 16384 │
├───────────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤
│        CS │  0.03 │ -0.01 │  0.07 │   0.0 │  0.02 │  0.02 │  0.01 │  0.01 │  0.02 │  0.01 │  0.01 │  0.01 │
│        ML │  0.01 │  -0.0 │  -0.0 │  -0.0 │   0.0 │  -0.0 │   0.0 │  -0.0 │  -0.0 │  -0.0 │   0.0 │  -0.0 │
│        MM │   0.0 │  -0.0 │   0.0 │  -0.0 │   0.0 │  -0.0 │   0.0 │  -0.0 │  -0.0 │   0.0 │  -0.0 │  -0.0 │
│        SG │ -1.03 │ -0.53 │ -0.25 │ -0.12 │ -0.05 │ -0.02 │  -0.0 │  -0.0 │   0.0 │  -0.0 │   0.0 │  -0.0 │
│       SHR │ -4.96 │ -4.02 │ -3.07 │ -2.17 │ -1.34 │ -0.67 │ -0.24 │ -0.07 │ -0.03 │ -0.02 │ -0.01 │  -0.0 │
│       JEF │ -4.97 │ -3.95 │ -2.94 │ -1.95 │ -1.09 │ -0.45 │ -0.11 │ -0.01 │  -0.0 │   0.0 │   0.0 │  -0.0 │
│       LAP │ -5.01 │ -4.02 │ -3.04 │  -2.1 │ -1.24 │ -0.56 │ -0.15 │ -0.01 │  -0.0 │  -0.0 │  -0.0 │   0.0 │
│       MIN │ -2.07 │ -1.44 │ -0.94 │ -0.57 │ -0.31 │ -0.13 │ -0.04 │  -0.0 │   0.0 │   0.0 │   0.0 │  -0.0 │
│       BUB │ -3.39 │ -2.23 │ -0.68 │ -0.16 │  0.11 │  -0.0 │  0.11 │  0.02 │  0.02 │  0.01 │  0.01 │  0.02 │
│       NSB │   NaN │   NaN │   NaN │ -1.91 │  -1.2 │ -0.59 │  -0.2 │ -0.03 │ -0.01 │ -0.01 │   0.0 │  0.01 │
│    Unseen │ -6.43 │ -7.64 │ -8.58 │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │
│     Zhang │  0.01 │  0.01 │  0.01 │  0.01 │  0.01 │  0.01 │  0.02 │  0.02 │  0.02 │  0.02 │  0.01 │  0.01 │
│        CJ │  0.01 │  0.01 │  0.03 │ -0.01 │  0.02 │  0.01 │  0.01 │  0.02 │  0.02 │  0.01 │  0.01 │  0.01 │
│   UnveilJ │ -0.25 │ -0.27 │ -0.48 │  -0.9 │ -1.21 │ -0.78 │ -0.27 │ -0.02 │  0.01 │  0.01 │  0.01 │  0.01 │
│        BN │   0.3 │  0.58 │  1.01 │   1.5 │  1.99 │  2.35 │  2.55 │  2.56 │  2.52 │  2.49 │  2.47 │  2.46 │
│       GSB │ -0.17 │ -0.22 │  -0.2 │  -0.2 │ -0.15 │ -0.09 │ -0.02 │  0.01 │  0.02 │  0.02 │  0.01 │  0.01 │
│       SHU │  0.06 │  0.04 │  0.05 │  0.03 │  0.03 │  0.02 │  0.02 │  0.01 │  0.02 │  0.01 │  0.02 │  0.02 │
│       CDM │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │   NaN │
└───────────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

The first table are the mean estimated entropies from DiscreteEntropy.jl, the
second table are the deviations to Table 1 from the paper (the mean is usually
caluclated from 1000 estimates, just for BUB and NSB I recuded this number
because they were much slower than the other estimators). The rows are the
different estimators and the columns the different sample sizes.

The entries which say NaN did not succeed to run. For CDM, I simple could
not figure out what it corresponds to in your package, thus these values are
missing, and for the rest you can find the error messages in the output:

NSB threw a Roots.ConvergenceFailed error for small sample sizes < 64 and
Unseen threw a DimensionMismatch error for sample size > 32.

Otherwise you can see that the comparison succeeds for ChaoShen (CS),
MaxiumLikelihood (ML), MillerMadow (MM), Zhang, ChaoWangJost (CJ) and Schürmann
(SHU). For the rest, however, there are noticeable deviations.

I saw in your test cases that you compare against the R package entropy
which works just fine, so I also tried using R to compare against the paper
(in a reduced fashion, only for a sample size of 8 and the estimators which are
available; I am not an R programmer ...):

library(entropy)

read_random_bytes <- function(n) {
  conn <- file("/dev/urandom", open = "rb")
  random_bytes <- readBin(conn, what = "raw", n = n)
  close(conn)
  return(random_bytes)
}

run <- function(draws) {
    for (m in c("ML", "MM", "Jeffreys", "Laplace", "SG", "minimax", "CS", "shrink")) {
        # `table()` does a frequency count
        vals <- sapply(1:draws, function(x) entropy(table(as.integer(read_random_bytes(8))), method = m, unit="log2"))
        print(paste(m, mean(vals)))
    }
}

suppressWarnings(run(1000))

Which yields:

[1] "ML 2.9735"
[1] "MM 3.59808838076576"
[1] "Jeffreys 2.97549971378874"
[1] "Laplace 2.97594394836141"
[1] "SG 2.97039231013123"
[1] "minimax 2.97438154499471"
[1] "CS 6.16299121698367"
[1] "shrink 2.97863055736597"

These are (almost) the same values that DiscreteEntropy.jl produces! But they
still disagree with the paper, of course.

It might well be that I made a mistake in the comparsion. Can you check this and
comment on the source of the deviations?

@kellino
Copy link
Owner

kellino commented Oct 11, 2024

Thanks @nluetts. I'll do my best to answer these questions, but some will need more investigation. First, I have not implemented "CDM". It doesn't seem much used in the literature and is largely (as far as I can tell) superceded in the literature by PYM. I'll probably add it to the library at a later date if there is any demand of it.

I can't comment on the quality of the work in Rodriguez et al. and I have not tried to match against their results. I really only used it as a (partial) basis for what to implement. As far as possible, I went back to the authors original implementations (where they existed) and compared output against them.

NSB is an interesting case, as the author's implementation gives different results from the python implementation used by Rodriguez et al., which gives different results from DiscreteEntropy.jl. I cannot claim with certainty that my version is correct, but it does most faithfully copy the maths as presented in the paper. NSB (and some of the other bayesian estimators) uses integration: this could be a major source of divergence. To avoid complex rewriting to avoid overflow (as found in other NSB implementations) I used Julia's arbitrary precision floats. Integrating over these could easily introduce some of the deviation that you see. The question then becomes was the lack of precision considered by the authors when presenting their maths. My guess is, it was not, and therefore deviations are due to floating point imperfections. I noticed while writing the code that these do add up very quickly.

Also, at small sample sizes, entropy estimators vary tremendously from run to run. 1000 iterations may not be enough for stability at very small sizes. I note that as sample size increases, the tables converge. BN is interesting though. It is specifically for small datasets (I will add this information into the docs): divergence increases as more samples are added, which is the behaviour I think I would expect to see and at small sizes we again have the stability problem. Looking at the code in frequentist.jl I can't immediately see an error compared against the formula in the docs.

I will add a check for sample size on NSB < 64, though that sounds more like a bug. I'm not sure what's happening with the Unseen, please give me a few days to figure it out.

@kellino
Copy link
Owner

kellino commented Oct 11, 2024

I've found the error in NSB: whenever the input has no coincidences (ie everything is seen only once), it fails. I need to figure out why that is happening and how to fix it. Thanks for the spot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants