-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing against Contreras Rodríguez et al. 2021 #11
Comments
Thanks @nluetts. I'll do my best to answer these questions, but some will need more investigation. First, I have not implemented "CDM". It doesn't seem much used in the literature and is largely (as far as I can tell) superceded in the literature by PYM. I'll probably add it to the library at a later date if there is any demand of it. I can't comment on the quality of the work in Rodriguez et al. and I have not tried to match against their results. I really only used it as a (partial) basis for what to implement. As far as possible, I went back to the authors original implementations (where they existed) and compared output against them. NSB is an interesting case, as the author's implementation gives different results from the python implementation used by Rodriguez et al., which gives different results from Also, at small sample sizes, entropy estimators vary tremendously from run to run. 1000 iterations may not be enough for stability at very small sizes. I note that as sample size increases, the tables converge. BN is interesting though. It is specifically for small datasets (I will add this information into the docs): divergence increases as more samples are added, which is the behaviour I think I would expect to see and at small sizes we again have the stability problem. Looking at the code in I will add a check for sample size on |
I've found the error in NSB: whenever the input has no coincidences (ie everything is seen only once), it fails. I need to figure out why that is happening and how to fix it. Thanks for the spot! |
Hi @kellino 👋
this is Nils, I am opening this in response to the review over at JOSS
(openjournals/joss-reviews#7334).
From your JOSS paper draft I got the impression that the paper from Contreras
Rodrı́guez et al. (https://doi.org/10.3390/e23050561) is quite central for your
package, so I thought testing your package against the estimated entropies
published there would be a good thing to do (which does not yet seem to be
covered by your test cases, although they are quite comprehensive otherwise).
In the paper, they estimate entropies for byte sequences they generated on
a Linux machine with
/dev/urandom
, so I did the same and tried toreproduce their Table 1 with
DiscreteEntropy.jl
.I used this Julia environment:
And the following script to do the comparison:
The script generates the following output (scroll to the end to find the two relevant tables):
The first table are the mean estimated entropies from
DiscreteEntropy.jl
, thesecond table are the deviations to Table 1 from the paper (the mean is usually
caluclated from 1000 estimates, just for BUB and NSB I recuded this number
because they were much slower than the other estimators). The rows are the
different estimators and the columns the different sample sizes.
The entries which say
NaN
did not succeed to run. For CDM, I simple couldnot figure out what it corresponds to in your package, thus these values are
missing, and for the rest you can find the error messages in the output:
NSB
threw aRoots.ConvergenceFailed
error for small sample sizes < 64 andUnseen
threw aDimensionMismatch
error for sample size > 32.Otherwise you can see that the comparison succeeds for ChaoShen (CS),
MaxiumLikelihood (ML), MillerMadow (MM), Zhang, ChaoWangJost (CJ) and Schürmann
(SHU). For the rest, however, there are noticeable deviations.
I saw in your test cases that you compare against the R package
entropy
which works just fine, so I also tried using R to compare against the paper
(in a reduced fashion, only for a sample size of 8 and the estimators which are
available; I am not an R programmer ...):
Which yields:
These are (almost) the same values that
DiscreteEntropy.jl
produces! But theystill disagree with the paper, of course.
It might well be that I made a mistake in the comparsion. Can you check this and
comment on the source of the deviations?
The text was updated successfully, but these errors were encountered: