Skip to content

Commit

Permalink
golang: speedup using TRIE
Browse files Browse the repository at this point in the history
Most patterns are simple search strings (not special Regexp symbols).
Some utilize ^ and $, which can be emulated in plaintext search by appending
these characters to the text itself for matching as regular characters.
Additionally, some patterns involve (xx|yy) or [xY] structures, which expand
to several plaintexts. Rare patterns require real regexp matching.

I've applied these simplifications and modifications. They are detected
automatically by function analyzePattern which returns the list of plain text
patterns (all possible search strings) or the main liternal and a regexp in
complex cases. The main literal is needed to know when to run the regexp.

Search strings are substituted with a random hex string of length 16 (to prevent
spontaneous or intentional matching with anything), followed by a label ("-" for
simple search strings, "*" for rare cases requiring regexp, and a number encoded
as "%05d" format).

All replacements are performed using strings.Replacer, which utilizes TRIE and
is therefore very fast. The random hex string is searched within the output of
the replacement. If it's not found, it indicates a mismatch. If found, it's
either a match (for simple search string labels) or a potential match (for
regexp patterns). In the latter case, the corresponding regexp is executed on
the text to verify the match.

Benchmark comparison:

$ benchstat old.txt new.txt
goos: linux
goarch: amd64
pkg: github.com/monperrus/crawler-user-agents
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
                           │   old.stat   │              new.stat               │
                           │    sec/op    │   sec/op     vs base                │
IsCrawlerPositive-2          70.453µ ± 1%   1.508µ ± 1%  -97.86% (p=0.000 n=20)
MatchingCrawlersPositive-2   77.343µ ± 1%   1.585µ ± 1%  -97.95% (p=0.000 n=20)
IsCrawlerNegative-2          75.237µ ± 0%   1.725µ ± 1%  -97.71% (p=0.000 n=20)
MatchingCrawlersNegative-2   75.884µ ± 1%   1.725µ ± 0%  -97.73% (p=0.000 n=20)
geomean                       74.68µ        1.633µ       -97.81%

                           │   old.stat   │                new.stat                 │
                           │     B/s      │      B/s       vs base                  │
IsCrawlerPositive-2          2.141Mi ± 1%   99.955Mi ± 1%  +4568.60% (p=0.000 n=20)
MatchingCrawlersPositive-2   1.950Mi ± 1%   95.067Mi ± 1%  +4774.57% (p=0.000 n=20)
IsCrawlerNegative-2          1.936Mi ± 0%   84.586Mi ± 1%  +4269.21% (p=0.000 n=20)
MatchingCrawlersNegative-2   1.926Mi ± 1%   84.615Mi ± 0%  +4292.33% (p=0.000 n=20)
geomean                      1.987Mi         90.81Mi       +4471.46%

New implementation is 40 times faster!
  • Loading branch information
starius committed Oct 14, 2024
1 parent 5e09d47 commit 5246b81
Show file tree
Hide file tree
Showing 2 changed files with 896 additions and 9 deletions.
Loading

0 comments on commit 5246b81

Please sign in to comment.