Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Most patterns are simple search strings (not special Regexp symbols). Some utilize ^ and $, which can be emulated in plaintext search by appending these characters to the text itself for matching as regular characters. Additionally, some patterns involve (xx|yy) or [xY] structures, which expand to several plaintexts. Rare patterns require real regexp matching. I've applied these simplifications and modifications. They are detected automatically by function analyzePattern which returns the list of plain text patterns (all possible search strings) or the main liternal and a regexp in complex cases. The main literal is needed to know when to run the regexp. Search strings are substituted with a random hex string of length 16 (to prevent spontaneous or intentional matching with anything), followed by a label ("-" for simple search strings, "*" for rare cases requiring regexp, and a number encoded as "%05d" format). All replacements are performed using strings.Replacer, which utilizes TRIE and is therefore very fast. The random hex string is searched within the output of the replacement. If it's not found, it indicates a mismatch. If found, it's either a match (for simple search string labels) or a potential match (for regexp patterns). In the latter case, the corresponding regexp is executed on the text to verify the match. Benchmark comparison: $ benchstat old.txt new.txt goos: linux goarch: amd64 pkg: github.com/monperrus/crawler-user-agents cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz │ old.stat │ new.stat │ │ sec/op │ sec/op vs base │ IsCrawlerPositive-2 70.453µ ± 1% 1.508µ ± 1% -97.86% (p=0.000 n=20) MatchingCrawlersPositive-2 77.343µ ± 1% 1.585µ ± 1% -97.95% (p=0.000 n=20) IsCrawlerNegative-2 75.237µ ± 0% 1.725µ ± 1% -97.71% (p=0.000 n=20) MatchingCrawlersNegative-2 75.884µ ± 1% 1.725µ ± 0% -97.73% (p=0.000 n=20) geomean 74.68µ 1.633µ -97.81% │ old.stat │ new.stat │ │ B/s │ B/s vs base │ IsCrawlerPositive-2 2.141Mi ± 1% 99.955Mi ± 1% +4568.60% (p=0.000 n=20) MatchingCrawlersPositive-2 1.950Mi ± 1% 95.067Mi ± 1% +4774.57% (p=0.000 n=20) IsCrawlerNegative-2 1.936Mi ± 0% 84.586Mi ± 1% +4269.21% (p=0.000 n=20) MatchingCrawlersNegative-2 1.926Mi ± 1% 84.615Mi ± 0% +4292.33% (p=0.000 n=20) geomean 1.987Mi 90.81Mi +4471.46% New implementation is 40 times faster!
- Loading branch information