Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithmically dense data structure for Corpus #2

Open
Eh2406 opened this issue Sep 29, 2023 · 0 comments
Open

Algorithmically dense data structure for Corpus #2

Eh2406 opened this issue Sep 29, 2023 · 0 comments

Comments

@Eh2406
Copy link

Eh2406 commented Sep 29, 2023

From a quick perusal of this code it consists of a series of checks to be performed on a corpus. The checks are fundamentally looking for similar names in the corpus, and the corpus is implemented as a Map (either hash or btree). Fundamentally this all looks like fuzzy queries over a data set, which is a well studied problem.

The fst as described in the excellent blog post Index 1,600,000,000 Keys with Automata and Rust allows storing the database in a very dense fashion while still supporting fuzzy queries. Levenstein is implemented in the crate, but it also supports defining your own similarity metrics. Fst really shines with extremely large data sets. I recently put all crate names in Fst and it was <2MB. I should still have that script around if you would like me to retrieve a more accurate number.

In situations where Fst is heavyweight for the number of items being searched there are other data structures that are efficient for doing similarity matching. I have heard of the fuzzy-search crate, but don't know how production ready it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant