New count-min estimator #1921

jianshu93 · 2023-10-19T23:52:37Z

Dear khmer team,

Amazed at how many efforts have been devoted to such a widely used project/software. As mentioned in the first version of the paper and also many places, count min has an estimation variance (overestimation) and improving the variance by using larger number of hash functions increase computational cost and also space/memory (O(m*p + p), where p is the number of hash functions). In a real word problem, I need the frequency of each kmer in a metagenomic sample, the more accurate the better. The maximum likelihood estimator (MLE) and a Bayesian estimator (BE, much slower) mentioned in the KDD paper by Daniel Ting (2018) (https://dl.acm.org/doi/abs/10.1145/3219819.3219975) seem to be more accurate than the default count min in many experiments. I imagine the BE could be very expensive but it seems to be the theoretical optimal in the case of non-parametric parameter estimation. MLE seems to be more practical but still the error distribution of each row in the hash table needs to be computed, which add additional computational cost. Good news is we do not need to increase the number of hash functions alone (the only way in the past) to improve accuracy. Very interested to know how the team thinks about the theoretical improvement from Ting 2018. New to probabilistic data structures but amazed at how they can be used to do efficient computation in metagenomics.

Thanks,

Jianshu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New count-min estimator #1921

New count-min estimator #1921

jianshu93 commented Oct 19, 2023

New count-min estimator #1921

New count-min estimator #1921

Comments

jianshu93 commented Oct 19, 2023