-
-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
golang: speedup using TRIE #353
Conversation
thanks @starius @javierron are you able to code-review this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@monperrus @starius
Very smart use of Replacer!
The main issue I see is that the list of patterns will need to be updated as more patterns are added to the file, which is not ideal.
Maybe the solution is to determine at initialization if a pattern is either literal or regex. I believe this would simplify analizePattern
, and not have too much impact on performance, since most of the patterns are literals anyway.
Other than that, I leave only some general comments.
validate.go
Outdated
@@ -80,31 +84,198 @@ var Crawlers = func() []Crawler { | |||
return crawlers | |||
}() | |||
|
|||
var regexps = func() []*regexp.Regexp { | |||
regexps := make([]*regexp.Regexp, len(Crawlers)) | |||
var pattern2literals = map[string][]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works well, but requires maintenance as more patterns are added to the file
validate.go
Outdated
return []string{prefix}, nil | ||
} | ||
|
||
mainLiternal, has := pattern2mainLiteral[pattern] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo mainLiternal -> mainLiteral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
regexps []regexpPattern | ||
} | ||
|
||
var uniqueToken = hex.EncodeToString((&maphash.Hash{}).Sum(nil)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do the same with a short, deterministic string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A short string may match spontaneously. A deterministic string may match intentionally: someone attacking the detector can pass a text with that string followed by wrong format in it, causing a panic in the current implementation.
|
||
for { | ||
uniquePos := strings.Index(replaced, uniqueToken) | ||
if uniquePos == -1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this check to the for declaration, it's a bit cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this, please?
for uniquePos := strings.Index(replaced, uniqueToken); uniquePos != -1; uniquePos = strings.Index(replaced, uniqueToken) {
}
Like this? I don't really like it, because it copy-pastes the strings.Index
call.
for { | ||
uniquePos := strings.Index(replaced, uniqueToken) | ||
if uniquePos == -1 { | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this check to the for declaration, it's a bit cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above.
} | ||
|
||
return false | ||
} | ||
|
||
// Finds all crawlers matching the User Agent and returns the list of their indices in Crawlers. | ||
func MatchingCrawlers(userAgent string) []int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of crashing, we can return error to allow the client to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above.
} | ||
oldnew = append(oldnew, oldnew2...) | ||
|
||
regexps2 := make([]regexpPattern, len(regexps)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save memory. regexps
is allocated in append
calls and it is likely to have extra capacity, because append
reallocates with reserves in capacity. regexps2
is allocated for exact size that is needed. We know the size only after finishing iterating over patterns, so we can not preallocate an array of the exact size in advance.
I added a comment in the code.
validate.go
Outdated
}) | ||
} | ||
|
||
replaceWith := fmt.Sprintf(" %s%c%05d ", uniqueToken, label, num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of the index (5) should be configurable somehow, to avoid the magic 5 later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Moved to const numLen
.
validate.go
Outdated
} | ||
|
||
start := uniquePos + len(uniqueToken) + 1 | ||
if start+5 >= len(replaced) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to recompute len(uniqueToken) every call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Moved to a const uniqueTokenLen
.
validate.go
Outdated
break | ||
} | ||
|
||
start := uniquePos + len(uniqueToken) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to recompute len(uniqueToken) every call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Moved to a const uniqueTokenLen
.
Hi @javierron ! Thank you very much for the review!
Can you elaborate on this, please? prefix, complete := re.LiteralPrefix()
if complete {
return []string{prefix}, nil
} (I put this code after checking presence in We still need both tables Do you have ideas how to do it in an elegant way? |
@starius thanks for the updates @javierron let me know how we should proceed thanks. |
Hi @starius Thanks for the response. I mean, we could have two sets: one set for literals, which are checked against using the trie solution; and another set for actual patterns, which are checked against by using a regex chain (regex1|regex2|...|regexn). This would avoid the issue of having to update the program every time a new crawler with a pattern is added. @monperrus Does that make any sense? |
Hi @javierron @monperrus ! I think it is possible to implement |
@starius would we be able to finish this work? that would be great! |
@monperrus Yes. Sorry for the delay 🙏 |
Most patterns are simple search strings (not special Regexp symbols). Some utilize ^ and $, which can be emulated in plaintext search by appending these characters to the text itself for matching as regular characters. Additionally, some patterns involve (xx|yy) or [xY] structures, which expand to several plaintexts. Rare patterns require real regexp matching. I've applied these simplifications and modifications. They are detected automatically by function analyzePattern which returns the list of plain text patterns (all possible search strings) or the main liternal and a regexp in complex cases. The main literal is needed to know when to run the regexp. Search strings are substituted with a random hex string of length 16 (to prevent spontaneous or intentional matching with anything), followed by a label ("-" for simple search strings, "*" for rare cases requiring regexp, and a number encoded as "%05d" format). All replacements are performed using strings.Replacer, which utilizes TRIE and is therefore very fast. The random hex string is searched within the output of the replacement. If it's not found, it indicates a mismatch. If found, it's either a match (for simple search string labels) or a potential match (for regexp patterns). In the latter case, the corresponding regexp is executed on the text to verify the match. Benchmark comparison: $ benchstat old.txt new.txt goos: linux goarch: amd64 pkg: github.com/monperrus/crawler-user-agents cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz │ old.stat │ new.stat │ │ sec/op │ sec/op vs base │ IsCrawlerPositive-2 70.453µ ± 1% 1.508µ ± 1% -97.86% (p=0.000 n=20) MatchingCrawlersPositive-2 77.343µ ± 1% 1.585µ ± 1% -97.95% (p=0.000 n=20) IsCrawlerNegative-2 75.237µ ± 0% 1.725µ ± 1% -97.71% (p=0.000 n=20) MatchingCrawlersNegative-2 75.884µ ± 1% 1.725µ ± 0% -97.73% (p=0.000 n=20) geomean 74.68µ 1.633µ -97.81% │ old.stat │ new.stat │ │ B/s │ B/s vs base │ IsCrawlerPositive-2 2.141Mi ± 1% 99.955Mi ± 1% +4568.60% (p=0.000 n=20) MatchingCrawlersPositive-2 1.950Mi ± 1% 95.067Mi ± 1% +4774.57% (p=0.000 n=20) IsCrawlerNegative-2 1.936Mi ± 0% 84.586Mi ± 1% +4269.21% (p=0.000 n=20) MatchingCrawlersNegative-2 1.926Mi ± 1% 84.615Mi ± 0% +4292.33% (p=0.000 n=20) geomean 1.987Mi 90.81Mi +4471.46% New implementation is 40 times faster!
@monperrus @javierron @srstsavage Please check the latest version of the PR! |
For the record and Github indexing: Most patterns are simple search strings (not special Regexp symbols). I've applied these simplifications and modifications. They are detected Search strings are substituted with a random hex string of length 16 (to prevent All replacements are performed using strings.Replacer, which utilizes TRIE and Benchmark comparison: $ benchstat old.txt new.txt
IsCrawlerPositive-2 2.141Mi ± 1% 99.955Mi ± 1% +4568.60% (p=0.000 n=20) New implementation is 40 times faster! |
thanks a lot @starius |
Most patterns are simple search strings (not special Regexp symbols). Some utilize ^ and $, which can be emulated in plaintext search by appending these characters to the text itself for matching as regular characters. Additionally, some patterns involve (xx|yy) or [xY] structures, which expand to several plaintexts. Rare patterns require real regexp matching.
I've applied these simplifications and modifications. They are detected automatically by function
analyzePattern
which returns the list of plain text patterns (all possible search strings) or the main liternal and a regexp in complex cases. The main literal is needed to know when to run the regexp.Search strings are substituted with a random hex string of length 16 (to prevent spontaneous or intentional matching with anything), followed by a label ("-" for simple search strings, "*" for rare cases requiring regexp, and a number encoded as "%05d" format).
All replacements are performed using strings.Replacer, which utilizes TRIE and is therefore very fast. The random hex string is searched within the output of the replacement. If it's not found, it indicates a mismatch. If found, it's either a match (for simple search string labels) or a potential match (for regexp patterns). In the latter case, the corresponding regexp is executed on the text to verify the match.
Benchmark comparison:
New implementation is 40 times faster!