You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Amazed at how many efforts have been devoted to such a widely used project/software. As mentioned in the first version of the paper and also many places, count min has an estimation variance (overestimation) and improving the variance by using larger number of hash functions increase computational cost and also space/memory (O(m*p + p), where p is the number of hash functions). In a real word problem, I need the frequency of each kmer in a metagenomic sample, the more accurate the better. The maximum likelihood estimator (MLE) and a Bayesian estimator (BE, much slower) mentioned in the KDD paper by Daniel Ting (2018) (https://dl.acm.org/doi/abs/10.1145/3219819.3219975) seem to be more accurate than the default count min in many experiments. I imagine the BE could be very expensive but it seems to be the theoretical optimal in the case of non-parametric parameter estimation. MLE seems to be more practical but still the error distribution of each row in the hash table needs to be computed, which add additional computational cost. Good news is we do not need to increase the number of hash functions alone (the only way in the past) to improve accuracy. Very interested to know how the team thinks about the theoretical improvement from Ting 2018. New to probabilistic data structures but amazed at how they can be used to do efficient computation in metagenomics.
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered:
Dear khmer team,
Amazed at how many efforts have been devoted to such a widely used project/software. As mentioned in the first version of the paper and also many places, count min has an estimation variance (overestimation) and improving the variance by using larger number of hash functions increase computational cost and also space/memory (O(m*p + p), where p is the number of hash functions). In a real word problem, I need the frequency of each kmer in a metagenomic sample, the more accurate the better. The maximum likelihood estimator (MLE) and a Bayesian estimator (BE, much slower) mentioned in the KDD paper by Daniel Ting (2018) (https://dl.acm.org/doi/abs/10.1145/3219819.3219975) seem to be more accurate than the default count min in many experiments. I imagine the BE could be very expensive but it seems to be the theoretical optimal in the case of non-parametric parameter estimation. MLE seems to be more practical but still the error distribution of each row in the hash table needs to be computed, which add additional computational cost. Good news is we do not need to increase the number of hash functions alone (the only way in the past) to improve accuracy. Very interested to know how the team thinks about the theoretical improvement from Ting 2018. New to probabilistic data structures but amazed at how they can be used to do efficient computation in metagenomics.
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: