Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New count-min estimator #1921

Open
jianshu93 opened this issue Oct 19, 2023 · 0 comments
Open

New count-min estimator #1921

jianshu93 opened this issue Oct 19, 2023 · 0 comments

Comments

@jianshu93
Copy link

Dear khmer team,

Amazed at how many efforts have been devoted to such a widely used project/software. As mentioned in the first version of the paper and also many places, count min has an estimation variance (overestimation) and improving the variance by using larger number of hash functions increase computational cost and also space/memory (O(m*p + p), where p is the number of hash functions). In a real word problem, I need the frequency of each kmer in a metagenomic sample, the more accurate the better. The maximum likelihood estimator (MLE) and a Bayesian estimator (BE, much slower) mentioned in the KDD paper by Daniel Ting (2018) (https://dl.acm.org/doi/abs/10.1145/3219819.3219975) seem to be more accurate than the default count min in many experiments. I imagine the BE could be very expensive but it seems to be the theoretical optimal in the case of non-parametric parameter estimation. MLE seems to be more practical but still the error distribution of each row in the hash table needs to be computed, which add additional computational cost. Good news is we do not need to increase the number of hash functions alone (the only way in the past) to improve accuracy. Very interested to know how the team thinks about the theoretical improvement from Ting 2018. New to probabilistic data structures but amazed at how they can be used to do efficient computation in metagenomics.

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant