Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Matching Whole Word to the Emit class #98

Open
RMSD opened this issue Jul 27, 2022 · 1 comment
Open

Add Matching Whole Word to the Emit class #98

RMSD opened this issue Jul 27, 2022 · 1 comment

Comments

@RMSD
Copy link

RMSD commented Jul 27, 2022

Currently Emit returns only the keyword and the start and end indexes of the keyword. I'd like to get the whole word that a match was part of.

Eg:

List<String> triggerList = Lists.list("cd");
Trie trie = Trie.builder()
		.addKeywords(triggerList)
		.build();

Collection<Emit> result = trie.parseText("abcdxyz abz cdefg");

// Ideal world, currently you have to do some whitespace/end of string searching starting
// with your given start and end indexes with a custom handler
result.stream()
       .map(Emit::matchingWord)
       .toList();

// List<String>("abcdxyz", "cdefg")
@RMSD
Copy link
Author

RMSD commented Jul 27, 2022

I have it really crudely implemented via emitHandler here:

        prefixTrie.parseText(text, emit -> {
           // All of my trie keywords are prefixes so I can guarantee wordStart will always be the start of a word.
           // Otherwise you have to search backwards from emit.Start() till you hit 0 or space.
            if (emit.getStart() != 0 && text.charAt(emit.getStart() - 1) != ' ') {
                return false;
            }

            // Because this lib doesn't offer matchPartOfWord method, there is a degenerative case where
            // we have a list of codes that all scan to the end of text.
            // Eg: codes 11222222 1222222 with text 11222222 will both scan nearly the full text and the entire method
            // will be bounded by O(n^2) if I recall my time complexity correctly.
            var wordStart = emit.getStart();
            var wordEnd = text.indexOf(" ", emit.getEnd());

            if (wordEnd < 0) {
                words.add(text.substring(wordStart));
            } else {
                words.add(text.substring(wordStart, wordEnd));
            }
            return true;
        });

Honestly I'm really hoping for a better solution than by brute force one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant