Algorithmically dense data structure for Corpus #2

Eh2406 · 2023-09-29T03:33:53Z

From a quick perusal of this code it consists of a series of checks to be performed on a corpus. The checks are fundamentally looking for similar names in the corpus, and the corpus is implemented as a Map (either hash or btree). Fundamentally this all looks like fuzzy queries over a data set, which is a well studied problem.

The fst as described in the excellent blog post Index 1,600,000,000 Keys with Automata and Rust allows storing the database in a very dense fashion while still supporting fuzzy queries. Levenstein is implemented in the crate, but it also supports defining your own similarity metrics. Fst really shines with extremely large data sets. I recently put all crate names in Fst and it was <2MB. I should still have that script around if you would like me to retrieve a more accurate number.

In situations where Fst is heavyweight for the number of items being searched there are other data structures that are efficient for doing similarity matching. I have heard of the fuzzy-search crate, but don't know how production ready it is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algorithmically dense data structure for Corpus #2

Algorithmically dense data structure for Corpus #2

Eh2406 commented Sep 29, 2023

Algorithmically dense data structure for Corpus #2

Algorithmically dense data structure for Corpus #2

Comments

Eh2406 commented Sep 29, 2023