Add benchmarking #163

masklinn · 2023-05-05T12:18:30Z

Benchmarking seems desirable to test things like #149, or #143

However, while uap-cpp added a benchmark of sorts in ua-parser/uap-cpp#27 I'm not sure that's a very useful one: the useragents.txt file is a set of real-world UA strings (I assume from DailyMotion's logs), but it's a set of them, not a sequence, it's 793 unique user agent strings.

While it is "real world" and can allow getting a spread of real-world spot performances for specific parsers, it doesn't allow testing things like caches (whether implementation or sizing), as caches can't come into play with such a dataset unless you're sizing the cache sufficiently for the entire dataset to fit in memory.

I don't know if @DailyMats is still active and would be able to provide something like a day or week worth of user agents to bench on, I also sent a message in a bottle to @getsentry but that doesn't seem more likely to be noticed. And the internets don't really seem to have publicly available real-world datasets.

Additional items:

script (e.g. -m command) to test various base parsers, caches, and cache sizes, on user provided samples
support for threaded benches in order to compare the overhead of different cache concurrency safety mechanisms (big lock, threadlocal, optimistic, futures, ...), likely more important post 3.13 for GIL-less support

DailyMats · 2023-05-05T15:16:18Z

Here is a sample of recent user agent strings, without deduplication:

useragents_2023-04-26.txt.gz

I unfortunately can't provide you with more details about it than that, but it should give you a more realistic (and more up-to-date) dataset to work with.

masklinn · 2023-05-05T16:03:29Z

Here is a sample of recent user agent strings, without deduplication:

useragents_2023-04-26.txt.gz

I unfortunately can't provide you with more details about it than that, but it should give you a more realistic (and more up-to-date) dataset to work with.

Oh wow, awesome! That's pretty much what I've been looking for.

New API with full typing ======================== Seems pretty self-explanatory, rather than returning somewhat ad-hoc dicts this API works off of dataclasses, it should be compatible with the legacy version through the magic of ~~buying two of them~~ `dataclasses.asdict`. Parser API ========== The legacy version had "parsers" which really represent individual parsing rules. In the new API the job of a parser is what the top-level functions did, they wrap around the entire job of parsing a user-agent string. The core API is just `__call__`, with a selection flag for the domains (seems like the least bad term for what "user agent", "os", and "device" are, other alternatives I considered are "component" and "category", but I'm still ambivalent). Overridable helpers are provided which match the old API's methods (with PEP8 conventions), as well as the same style of helpers at the package toplevel. This resolves a number of limitations: Concurrency ----------- While the library should be thread-safe (and I need to find a way to test that) the ability to instantiate parsers should provide the opportunity for things like thread-local parsers, or actual parallelism if we start using native extensions (regex, re2). It also allows running multiple *parser configurations* concurrently, including e.g. multiple independent custom yaml sets. Not sure there's a use for it, but why not? At the very least it should make using custom YAML datasets much easier than having to set envvars. Customization ------------- Public APIs are provided both to instantiate and tune parsers, and to set the global parser. Hopefully this makes evaluating proposed parsers as well as evaluating & tuning caches (algorithm & size) easier. Even more so as we should provide some sort of evaluation CLI in ua-parser#163. Caches ------ In the old API, package-provided API could only be global and with a single implementation as it had to integrate with the toplevel parsing functions. By reifying the parsing job, a cache is just a parser which delegates the parse if it doesn't have a hit. This allows more easily providing, testing, and evolving alternative cache strategies. Bulk APIs --------- The current parser checks rules (regexes) one at a time on the input, but there are advanced regex APIs which can check a regex *set* and return which one(s) matched, allowing much more efficicent bulk matching e.g. google's re2, rust's regex. With the old scheme, this would be a pretty significant change in use / behaviour, obviating the use of the "parsers" with no recourse. Under the new parsing scheme, these can just be different "base" parsers, they can be the default, they can be cached, and users can instantiate their own parser instead. Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116

Requires splitting out some of the testenvs, as re2 is not available for pypy at all, and not yet for 3.12. Only uses re2.Set which turns out to be not great, at least according to `pytest --durations` on 3.11: - re2 is sometimes faster for UA tests - `pgts_browser_list.yaml` goes from 2.5s to 1.5 - `firefox_user_agent_strings.yaml` goes from 0.05 to 0.04 (not really significant) - though `test_ua.yaml` goes from 0.18 to 0.65 - re2 is *way* slower for devices tests - `test_device.yaml` goes from 2.5 to 8s Obviously tests might not be representative at all, implementing a proper benchmark on a real-life test-set (ua-parser#163) would likely provide better information. It's possible that `FilteredRE2` would would offer better performances, *but* it requires additional memory and more importantly it requires a fast literal string matcher e.g. a fast implementation of Aho-Corasick, or possibly Hyperscan's Teddy (via [python-hyperscan][5]?). [According to burntsushi commentz-walter is not great in practice][1], at least as you increase the number of patterns, so that one looks like a dead end. Either way this would likely be an *additional* dependency to make it usable, although there seems to be [a well-maintained Python version with impressive performances (for pure python)][2], [a native module][3], and [a wrapper for burntsushi's rust implementation][4] which claims even better performances than the native module. Linked to (but probably can't be argued to fix) ua-parser#149. [1]: https://news.ycombinator.com/item?id=26913349 [2]: https://github.com/abusix/ahocorapy [3]: https://github.com/WojciechMula/pyahocorasick/ [4]: https://github.com/G-Research/ahocorasick_rs/ [5]: https://python-hyperscan.readthedocs.io

New API with full typing ======================== Seems pretty self-explanatory, rather than returning somewhat ad-hoc dicts this API works off of dataclasses, it should be compatible with the legacy version through the magic of ~~buying two of them~~ `dataclasses.asdict`. Parser API ========== The legacy version had "parsers" which really represent individual parsing rules. In the new API the job of a parser is what the top-level functions did, they wrap around the entire job of parsing a user-agent string. The core API is just `__call__`, with a selection flag for the domains (seems like the least bad term for what "user agent", "os", and "device" are, other alternatives I considered are "component" and "category", but I'm still ambivalent). Overridable helpers are provided which match the old API's methods (with PEP8 conventions), as well as the same style of helpers at the package toplevel. This resolves a number of limitations: Concurrency ----------- While the library should be thread-safe (and I need to find a way to test that) the ability to instantiate parsers should provide the opportunity for things like thread-local parsers, or actual parallelism if we start using native extensions (regex, re2). It also allows running multiple *parser configurations* concurrently, including e.g. multiple independent custom yaml sets. Not sure there's a use for it, but why not? At the very least it should make using custom YAML datasets much easier than having to set envvars. Customization ------------- Public APIs are provided both to instantiate and tune parsers, and to set the global parser. Hopefully this makes evaluating proposed parsers as well as evaluating & tuning caches (algorithm & size) easier. Even more so as we should provide some sort of evaluation CLI in ua-parser#163. Caches ------ In the old API, package-provided API could only be global and with a single implementation as it had to integrate with the toplevel parsing functions. By reifying the parsing job, a cache is just a parser which delegates the parse if it doesn't have a hit. This allows more easily providing, testing, and evolving alternative cache strategies. Bulk APIs --------- The current parser checks rules (regexes) one at a time on the input, but there are advanced regex APIs which can check a regex *set* and return which one(s) matched, allowing much more efficicent bulk matching e.g. google's re2, rust's regex. With the old scheme, this would be a pretty significant change in use / behaviour, obviating the use of the "parsers" with no recourse. Under the new parsing scheme, these can just be different "base" parsers, they can be the default, they can be cached, and users can instantiate their own parser instead. Misc ---- The new API's UA extractor pipeline supports `patch_minor`, though that requires excluding that bit from the tests as there are apparently broken test cases around that item (ua-parser/uap-core#562). Init Helpers ============ Having proper parsers is the opportunity to allow setting parsers at runtime more easily (instead of load-time envvars), however optional constructors (classmethods) turns out to be iffy from an API and typing perspective both. Instead have the "base" parsers (the ones doing the actual parsing of the UAs) just take a uniform parsed data set, and have utility loaders provide that from various data sources (precompiled, preformatted, or data files). This avoids redundancy and the need for mixins / inheritance, and mypy is *much* happier. Legacy Parsers -> New Matchers ============================== The bridging of the legacy parsers and the new results turned out to be pretty mid. Instead, the new API relies on similar but better typed matcher classes, with a slightly different API: they return `None` on a match failure instead of a triplet, which make them compose better in iteration (e.g. can just `filter` them out). Add a `Matchers` alias to carry them around (a tuple of lists of matchers) for convenience, as well as as base parser parameter. Also clarify the replacer rules, and hopefully implement the thing more clearly. Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116

New API with full typing ======================== Seems pretty self-explanatory, rather than returning somewhat ad-hoc dicts this API works off of dataclasses, it should be compatible with the legacy version through the magic of ~~buying two of them~~ `dataclasses.asdict`. Parser API ========== The legacy version had "parsers" which really represent individual parsing rules. In the new API the job of a parser is what the top-level functions did, they wrap around the entire job of parsing a user-agent string. The core API is just `__call__`, with a selection flag for the domains (seems like the least bad term for what "user agent", "os", and "device" are, other alternatives I considered are "component" and "category", but I'm still ambivalent). Overridable helpers are provided which match the old API's methods (with PEP8 conventions), as well as the same style of helpers at the package toplevel. This resolves a number of limitations: Concurrency ----------- While the library should be thread-safe (and I need to find a way to test that) the ability to instantiate parsers should provide the opportunity for things like thread-local parsers, or actual parallelism if we start using native extensions (regex, re2). It also allows running multiple *parser configurations* concurrently, including e.g. multiple independent custom yaml sets. Not sure there's a use for it, but why not? At the very least it should make using custom YAML datasets much easier than having to set envvars. The caching parser being stateful, it's protected by an optional lock seems like the best way to make caching thread-safe. When only using a single thread, or using thread-local parsers, caching can be disabled by using a `contextlib.nullcontext` as lock. Customization ------------- Public APIs are provided both to instantiate and tune parsers, and to set the global parser. Hopefully this makes evaluating proposed parsers as well as evaluating & tuning caches (algorithm & size) easier. Even more so as we should provide some sort of evaluation CLI in ua-parser#163. Caches ------ In the old API, package-provided API could only be global and with a single implementation as it had to integrate with the toplevel parsing functions. By reifying the parsing job, a cache is just a parser which delegates the parse if it doesn't have a hit. This allows more easily providing, testing, and evolving alternative cache strategies. Bulk APIs --------- The current parser checks rules (regexes) one at a time on the input, but there are advanced regex APIs which can check a regex *set* and return which one(s) matched, allowing much more efficicent bulk matching e.g. google's re2, rust's regex. With the old scheme, this would be a pretty significant change in use / behaviour, obviating the use of the "parsers" with no recourse. Under the new parsing scheme, these can just be different "base" parsers, they can be the default, they can be cached, and users can instantiate their own parser instead. Misc ---- The new API's UA extractor pipeline supports `patch_minor`, though that requires excluding that bit from the tests as there are apparently broken test cases around that item (ua-parser/uap-core#562). Init Helpers ============ Having proper parsers is the opportunity to allow setting parsers at runtime more easily (instead of load-time envvars), however optional constructors (classmethods) turns out to be iffy from an API and typing perspective both. Instead have the "base" parsers (the ones doing the actual parsing of the UAs) just take a uniform parsed data set, and have utility loaders provide that from various data sources (precompiled, preformatted, or data files). This avoids redundancy and the need for mixins / inheritance, and mypy is *much* happier. Legacy Parsers -> New Matchers ============================== The bridging of the legacy parsers and the new results turned out to be pretty mid. Instead, the new API relies on similar but better typed matcher classes, with a slightly different API: they return `None` on a match failure instead of a triplet, which make them compose better in iteration (e.g. can just `filter` them out). Add a `Matchers` alias to carry them around (a tuple of lists of matchers) for convenience, as well as as base parser parameter. Also clarify the replacer rules, and hopefully implement the thing more clearly. Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116

New API with full typing ======================== Seems pretty self-explanatory, rather than returning somewhat ad-hoc dicts this API works off of dataclasses, it should be compatible with the legacy version through the magic of ~~buying two of them~~ `dataclasses.asdict`. Parser API ========== The legacy version had "parsers" which really represent individual parsing rules. In the new API the job of a parser is what the top-level functions did, they wrap around the entire job of parsing a user-agent string. The core API is just `__call__`, with a selection flag for the domains (seems like the least bad term for what "user agent", "os", and "device" are, other alternatives I considered are "component" and "category", but I'm still ambivalent). Overridable helpers are provided which match the old API's methods (with PEP8 conventions), as well as the same style of helpers at the package toplevel. This resolves a number of limitations: Concurrency ----------- While the library should be thread-safe (and I need to find a way to test that) the ability to instantiate parsers should provide the opportunity for things like thread-local parsers, or actual parallelism if we start using native extensions (regex, re2). It also allows running multiple *parser configurations* concurrently, including e.g. multiple independent custom yaml sets. Not sure there's a use for it, but why not? At the very least it should make using custom YAML datasets much easier than having to set envvars. The caching parser being stateful, it's protected by an optional lock seems like the best way to make caching thread-safe. When only using a single thread, or using thread-local parsers, caching can be disabled by using a `contextlib.nullcontext` as lock. Customization ------------- Public APIs are provided both to instantiate and tune parsers, and to set the global parser. Hopefully this makes evaluating proposed parsers as well as evaluating & tuning caches (algorithm & size) easier. Even more so as we should provide some sort of evaluation CLI in #163. Caches ------ In the old API, package-provided API could only be global and with a single implementation as it had to integrate with the toplevel parsing functions. By reifying the parsing job, a cache is just a parser which delegates the parse if it doesn't have a hit. This allows more easily providing, testing, and evolving alternative cache strategies. Bulk APIs --------- The current parser checks rules (regexes) one at a time on the input, but there are advanced regex APIs which can check a regex *set* and return which one(s) matched, allowing much more efficicent bulk matching e.g. google's re2, rust's regex. With the old scheme, this would be a pretty significant change in use / behaviour, obviating the use of the "parsers" with no recourse. Under the new parsing scheme, these can just be different "base" parsers, they can be the default, they can be cached, and users can instantiate their own parser instead. Misc ---- The new API's UA extractor pipeline supports `patch_minor`, though that requires excluding that bit from the tests as there are apparently broken test cases around that item (ua-parser/uap-core#562). Init Helpers ============ Having proper parsers is the opportunity to allow setting parsers at runtime more easily (instead of load-time envvars), however optional constructors (classmethods) turns out to be iffy from an API and typing perspective both. Instead have the "base" parsers (the ones doing the actual parsing of the UAs) just take a uniform parsed data set, and have utility loaders provide that from various data sources (precompiled, preformatted, or data files). This avoids redundancy and the need for mixins / inheritance, and mypy is *much* happier. Legacy Parsers -> New Matchers ============================== The bridging of the legacy parsers and the new results turned out to be pretty mid. Instead, the new API relies on similar but better typed matcher classes, with a slightly different API: they return `None` on a match failure instead of a triplet, which make them compose better in iteration (e.g. can just `filter` them out). Add a `Matchers` alias to carry them around (a tuple of lists of matchers) for convenience, as well as as base parser parameter. Also clarify the replacer rules, and hopefully implement the thing more clearly. Fixes #93, fixes #142, closes #116

@DailyMats

useragents.txt sample file kindly provided by @DailyMats out of DailyMotion's data (2023-04-26). The provided scripts allow: - Testing the cache hit rate of various cache configuration (algorithm and size) on sample files, this script uses a dummy parser and is thus extremely fast. - Benchmarking the average entry processing of various parser configurations (base parser + cache algoritm + cache size) on sample files, this is a much slower script but provides a a realistic evaluation, and allows using custom rules (`regexes.yaml` files) to check their impact on the performance of a given base parser. Also added a script for testing threaded parsing, as expected this gets 0 gain over the normal stuff because of the GIL (and re2 seemingly doesn't release the GIL either, though I don't know how beneficial it would be at ~30us per call). May be more useful with 3.13, or possibly with a regex-based extension releasing the GIL, at least the basis for testing things out will be here.

microgorgage · 2024-02-11T19:36:53Z

你的邮件我已经收到，我会尽快查阅！

masklinn force-pushed the benchmarks branch from df661dc to a61cc22 Compare July 1, 2023 18:19

masklinn mentioned this pull request Feb 4, 2024

new api #116

Merged

4 tasks

masklinn added this to the 1.0 milestone Feb 6, 2024

masklinn force-pushed the benchmarks branch 2 times, most recently from c0dc5d8 to 905c2ea Compare February 11, 2024 16:42

masklinn marked this pull request as ready for review February 11, 2024 16:42

masklinn force-pushed the benchmarks branch from 905c2ea to 15a9e3b Compare February 11, 2024 19:16

masklinn merged commit 9960dbd into ua-parser:master Feb 11, 2024
29 checks passed

masklinn deleted the benchmarks branch February 11, 2024 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarking #163

Add benchmarking #163

masklinn commented May 5, 2023 •

edited

Loading

DailyMats commented May 5, 2023

masklinn commented May 5, 2023

microgorgage commented Feb 11, 2024 via email

Add benchmarking #163

Add benchmarking #163

Conversation

masklinn commented May 5, 2023 • edited Loading

DailyMats commented May 5, 2023

masklinn commented May 5, 2023

microgorgage commented Feb 11, 2024 via email

masklinn commented May 5, 2023 •

edited

Loading