Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HashChain search #29

Open
nishihatapalmer opened this issue Jun 15, 2022 · 1 comment
Open

Implement HashChain search #29

nishihatapalmer opened this issue Jun 15, 2022 · 1 comment

Comments

@nishihatapalmer
Copy link
Owner

nishihatapalmer commented Jun 15, 2022

HashChain is pretty much the fastest search for longer patterns. Implement it for byteseek.

It works well for low alphabets and high alphabets, and is tunable for different memory requirements. It's a factor search, most similar to WFR in that it uses a rolling hash construction. It creates chains of hashes in the hash table with a fingerprint that allows verification that the next hash is correct before doing another index lookup.

Need to figure out where its worst cases get triggered. Search time rises dramatically (like QF, and WFR). In practice it seems quite hard to trigger if the hash table is sufficiently empty or we aren't dealing with pathological low alphabet or repetitive data. It still does better than WFR or QF in these situations though.

Nevertheless, byteseek needs to know if it is appropriate to select this algorithm or whether another slower but without the same extreme worst case behaviour needs to be used instead. Or if the pattern can be modified (use a substring) that would avoid the pathological case? A pattern composed only of zeros might not do well with hash chain.

@nishihatapalmer
Copy link
Owner Author

NB. HashChain is not yet finished or published. I'm still working on fully understanding and optimising the hash table construction, and defining what tuning parameters are required (automate as far as possible, use should only select memory really, although knowledge of low/high alphabet or types of data to search (shannon entropy?) could influence selection of qgram length and bitshift parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant