-
-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Golang package #348
Add Golang package #348
Conversation
They prevented parsing in Go.
Performance increase is huge! Go regexp: 0.05 MB/s go-re2 in pure Go mode: 77.84 MB/s go-re2 using C++ Re2 (-tags re2_cgo): 213.85 MB/s To enable C++ Re2, install it: sudo apt-get install libre2-dev and pass -tags re2_cgo build tag.
Re2 is fast on large Regexps (faster than when running individually on each RE, including with Go regexp). I used this fact to find matching regexps using tree of regexps of concatenated parts patterns. The individual regexps are found by going from root node of the tree to down. Benchmark results BenchmarkMatchingCrawlers: Before this commit (Re2 individually, pure Go): 0.32 MB/s Before this commit (Re2 individually, -tags re2_cgo): 1.32 MB/s If Go regexp is used individually: 2.31 MB/s With this commit (Re2, pure Go): 5.90 MB/s With this commit (Re2, -tags re2_cgo): 18.24 MB/s Maybe it can be improved even better with hyperscan, but I don't want to bring another dependency.
Github actions workflow: https://github.com/starius/crawler-user-agents/actions/workflows/golang.yml |
validate_test.go
Outdated
t.Run(crawler.URL, func(t *testing.T) { | ||
// print pattern to console for quickcheck in CI | ||
fmt.Print(crawler.Pattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use fmt.Println to print each pattern on a separate line.
Also maybe it is better to put crawler.Pattern as subtest name (first argument of t.Run
) and run with go test -v
, it will print each subtest name (which would be a pattern).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea, could you do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I pushed to the branch.
Thanks a lot for the great contribution!
I've asked for PR approval by Go experts. |
Added an example of Go program and fixed copy-paste in Go benchmark. |
Thanks a lot @starius I really appreciate. We really worry about software supply chain security for crawler-user-agents (cc/ @ericcornelissen @javierron), and we would like to keep minimal external dependencies. In particular, I'd like to remove dependency to If this means moving from What do you think? |
See monperrus#348 (comment) Also, it turned out to be faster if regexps are checked individually, not as one large |-concatenation of regexps. One regexp check consumes 66 microseconds on Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz.
Thank you for feedback! I removed I acknowledge the problems with wazero and re2. I just caught a crash in re2, related to wazero! I switched back to Go standard regexp. It turned out to be not as bad, if regexps are checked one by one, not one regexp for all patterns. One |
Check against the list from https://github.com/microlinkhq/top-user-agents Fix monperrus#350
I pushed another commit to check against false positives. It fixes #350 |
great, many thanks @starius |
Hi @starius Afterthought of @javierron: the way the regex is written, we still need to do n regex matches when matching against the two depth=1 nodes (and then some more). Maybe a trie based join approach would be better? WDYT? |
Hi @monperrus ! Using TRIE looks good to me! The only TRIE implementation in Go standard library I am aware of is https://pkg.go.dev/strings#NewReplacer The problem is that some regexps are not just search strings, but actually use regexp syntax, e.g. |
@monperrus See #353 |
Golang package embeds the JSON file with patterns using Go's go:embed feature. Go package is kept in sync automatically with the JSON file. No manual updates of Go package are needed to keep Go package in sync.
The JSON file is parsed at load time of Go package and exposed in API as Go list of type Crawler. Functions IsCrawler and MatchingCrawlers provide a way to check User Agent if it is a crawler. The functions use go-re2 library to run regexps to achieve high speed compared to standard library regexp engine. I implemented function MatchingCrawlers in a smart way to improve performance: I combine regexps into a binary tree and use it when searching. Since RE2 works faster on large regexps than individually on each regexp, it brings speed-up.
I also provided Github workflow to run tests and banchmarks of Go package on each push.
To achieve the best performance possible in functions IsCrawler and MatchingCrawlers, install C++ RE2 into your system:
and pass tag:
-tags re2_cgo