Typed API & parsers API

New API with full typing ======================== Seems pretty self-explanatory, rather than returning somewhat ad-hoc dicts this API works off of dataclasses, it should be compatible with the legacy version through the magic of ~~buying two of them~~ `dataclasses.asdict`. Parser API ========== The legacy version had "parsers" which really represent individual parsing rules. In the new API the job of a parser is what the top-level functions did, they wrap around the entire job of parsing a user-agent string. The core API is just `__call__`, with a selection flag for the domains (seems like the least bad term for what "user agent", "os", and "device" are, other alternatives I considered are "component" and "category", but I'm still ambivalent). Overridable helpers are provided which match the old API's methods (with PEP8 conventions), as well as the same style of helpers at the package toplevel. This resolves a number of limitations: Concurrency ----------- While the library should be thread-safe (and I need to find a way to test that) the ability to instantiate parsers should provide the opportunity for things like thread-local parsers, or actual parallelism if we start using native extensions (regex, re2). It also allows running multiple *parser configurations* concurrently, including e.g. multiple independent custom yaml sets. Not sure there's a use for it, but why not? At the very least it should make using custom YAML datasets much easier than having to set envvars. The caching parser being stateful, it's protected by an optional lock seems like the best way to make caching thread-safe. When only using a single thread, or using thread-local parsers, caching can be disabled by using a `contextlib.nullcontext` as lock. Customization ------------- Public APIs are provided both to instantiate and tune parsers, and to set the global parser. Hopefully this makes evaluating proposed parsers as well as evaluating & tuning caches (algorithm & size) easier. Even more so as we should provide some sort of evaluation CLI in ua-parser#163. Caches ------ In the old API, package-provided API could only be global and with a single implementation as it had to integrate with the toplevel parsing functions. By reifying the parsing job, a cache is just a parser which delegates the parse if it doesn't have a hit. This allows more easily providing, testing, and evolving alternative cache strategies. Bulk APIs --------- The current parser checks rules (regexes) one at a time on the input, but there are advanced regex APIs which can check a regex *set* and return which one(s) matched, allowing much more efficicent bulk matching e.g. google's re2, rust's regex. With the old scheme, this would be a pretty significant change in use / behaviour, obviating the use of the "parsers" with no recourse. Under the new parsing scheme, these can just be different "base" parsers, they can be the default, they can be cached, and users can instantiate their own parser instead. Misc ---- The new API's UA extractor pipeline supports `patch_minor`, though that requires excluding that bit from the tests as there are apparently broken test cases around that item (ua-parser/uap-core#562). Init Helpers ============ Having proper parsers is the opportunity to allow setting parsers at runtime more easily (instead of load-time envvars), however optional constructors (classmethods) turns out to be iffy from an API and typing perspective both. Instead have the "base" parsers (the ones doing the actual parsing of the UAs) just take a uniform parsed data set, and have utility loaders provide that from various data sources (precompiled, preformatted, or data files). This avoids redundancy and the need for mixins / inheritance, and mypy is *much* happier. Legacy Parsers -> New Matchers ============================== The bridging of the legacy parsers and the new results turned out to be pretty mid. Instead, the new API relies on similar but better typed matcher classes, with a slightly different API: they return `None` on a match failure instead of a triplet, which make them compose better in iteration (e.g. can just `filter` them out). Add a `Matchers` alias to carry them around (a tuple of lists of matchers) for convenience, as well as as base parser parameter. Also clarify the replacer rules, and hopefully implement the thing more clearly. Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
masklinn · Feb 5, 2024 · 490c673 · 490c673
1 parent 6ad6f04
commit 490c673
Show file tree

Hide file tree

Showing 14 changed files with 1,357 additions and 97 deletions.
diff --git a/README.rst b/README.rst
@@ -1,119 +1,127 @@
 uap-python
 ==========
 
-A python implementation of the UA Parser (https://github.com/ua-parser,
-formerly https://github.com/tobie/ua-parser)
+Official python implementation of the `User Agent String
+Parser <https://github.com/ua-parser>`_ project.
 
 Build Status
 ------------
 
 .. image:: https://github.com/ua-parser/uap-python/actions/workflows/ci.yml/badge.svg
    :alt: CI on the master branch
 
-
 Installing
 ----------
 
-Install via pip
-~~~~~~~~~~~~~~~
-
-Just run:
+Just add ``ua-parser`` to your project's dependencies, or run
 
 .. code-block:: sh
 
     $ pip install ua-parser
 
-Manual install
-~~~~~~~~~~~~~~
-
-In the top-level directory run:
-
-.. code-block:: sh
-
-    $ python setup.py install
-
-Change Log
----------------
-Because this repo is mostly a python wrapper for the User Agent String Parser repo (https://github.com/ua-parser/uap-core), the changes made to this repo are best described by the update diffs in that project. Please see the diffs for this submodule (https://github.com/ua-parser/uap-core/releases) for a list of what has changed between versions of this package.
+to install in the current environment.
 
 Getting Started
 ---------------
 
-Retrieve data on a user-agent string
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Retrieve all data on a user-agent string
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. code-block:: python
 
-    >>> from ua_parser import user_agent_parser
-    >>> import pprint
-    >>> pp = pprint.PrettyPrinter(indent=4)
+    >>> from ua_parser import parse
     >>> ua_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
-    >>> parsed_string = user_agent_parser.Parse(ua_string)
-    >>> pp.pprint(parsed_string)
-    {   'device': {'brand': 'Apple', 'family': 'Mac', 'model': 'Mac'},
-        'os': {   'family': 'Mac OS X',
-                  'major': '10',
-                  'minor': '9',
-                  'patch': '4',
-                  'patch_minor': None},
-        'string': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) '
-                  'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 '
-                  'Safari/537.36',
-        'user_agent': {   'family': 'Chrome',
-                          'major': '41',
-                          'minor': '0',
-                          'patch': '2272'}}
-
-Extract browser data from user-agent string
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    >>> parse(ua_string) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
+    ParseResult(user_agent=UserAgent(family='Chrome',
+                                     major='41',
+                                     minor='0',
+                                     patch='2272',
+                                     patch_minor='104'),
+                os=OS(family='Mac OS X',
+                      major='10',
+                      minor='9',
+                      patch='4',
+                      patch_minor=None),
+                device=Device(family='Mac',
+                              brand='Apple',
+                              model='Mac'),
+                string='Mozilla/5.0 (Macintosh; Intel Mac OS...
+
+Any datum not found in the user agent string is set to ``None``::
+
+    >>> parse("")
+    ParseResult(user_agent=None, os=None, device=None, string='')
+
+Extract only browser data from user-agent string
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. code-block:: python
 
-    >>> from ua_parser import user_agent_parser
-    >>> import pprint
-    >>> pp = pprint.PrettyPrinter(indent=4)
+    >>> from ua_parser import parse_user_agent
     >>> ua_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
-    >>> parsed_string = user_agent_parser.ParseUserAgent(ua_string)
-    >>> pp.pprint(parsed_string)
-    {'family': 'Chrome', 'major': '41', 'minor': '0', 'patch': '2272'}
+    >>> parse_user_agent(ua_string)
+    UserAgent(family='Chrome', major='41', minor='0', patch='2272', patch_minor='104')
 
-..
+For specific domains, a match failure just returns ``None``::
 
-    ⚠️Before 0.15, the convenience parsers (``ParseUserAgent``,
-    ``ParseOs``, and ``ParseDevice``) were not cached, which could
-    result in degraded performances when parsing large amounts of
-    identical user-agents (which might occur for real-world datasets).
-
-    For these versions (up to 0.10 included), prefer using ``Parse``
-    and extracting the sub-component you need from the resulting
-    dictionary.
+    >>> parse_user_agent("")
 
 Extract OS information from user-agent string
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. code-block:: python
 
-    >>> from ua_parser import user_agent_parser
-    >>> import pprint
-    >>> pp = pprint.PrettyPrinter(indent=4)
+    >>> from ua_parser import parse_os
     >>> ua_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
-    >>> parsed_string = user_agent_parser.ParseOS(ua_string)
-    >>> pp.pprint(parsed_string)
-    {   'family': 'Mac OS X',
-        'major': '10',
-        'minor': '9',
-        'patch': '4',
-        'patch_minor': None}
-
-Extract Device information from user-agent string
+    >>> parse_os(ua_string)
+    OS(family='Mac OS X', major='10', minor='9', patch='4', patch_minor=None)
+
+Extract device information from user-agent string
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. code-block:: python
 
-    >>> from ua_parser import user_agent_parser
-    >>> import pprint
-    >>> pp = pprint.PrettyPrinter(indent=4)
+    >>> from ua_parser import parse_device
     >>> ua_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
-    >>> parsed_string = user_agent_parser.ParseDevice(ua_string)
-    >>> pp.pprint(parsed_string)
-    {'brand': 'Apple', 'family': 'Mac', 'model': 'Mac'}
+    >>> parse_device(ua_string)
+    Device(family='Mac', brand='Apple', model='Mac')
+
+Parser
+~~~~~~
+
+Parsers expose the same functions (``parse``, ``parse_user_agent``,
+``parse_os``, and ``parse_device``) as the top-level of the package,
+however these are all *utility* methods.
+
+The actual protocol of parsers, and the one method which must be
+implemented / overridden is::
+
+    def __call__(self, str, Components, /) -> ParseResult:
+
+It's similar to but more flexible than ``parse``:
+
+- The ``str`` is the user agent string.
+- The ``Components`` is a hint, through which the caller requests the
+  domain (component) they are looking for, any combination of
+  ``Components.USER_AGENT``, ``Components.OS``, and
+  ``Components.DEVICE``. ``Domains.ALL`` exists as a convenience alias
+  for the combination of all three.
+
+  The parser *must* return at least the requested information, but if
+  that's more convenient or no more expensive it *can* return more.
+- The ``ParseResult`` is similar to ``CompleteParseResult``, except
+  all the attributes are ``Optional`` and it has a ``components:
+  Components`` attribute which specifies whether a component was never
+  requested (its value for the user agent string is unknown) or it has
+  been requested but could not be resolved (no match was found for the
+  user agent).
+
+  ``ParseResult.complete()`` convert to a ``CompleteParseResult`` if
+  all the components are set, and raise an exception otherwise. If
+  some of the components are set to ``None``, they'll be swapped for a
+  default value.
+
+Calling the parser directly is part of the public API. One of the
+advantage is that it does not return default values, as such it allows
+more easily differentiating between a non-match (= ``None``) and a
+default fallback (``family = "Other"``).
diff --git a/pyproject.toml b/pyproject.toml
@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "ua-parser"
 description = "Python port of Browserscope's user agent parser"
-version = "1.0.0a"
+version = "1.0.0a1"
 readme = "README.rst"
 requires-python = ">=3.8"
 dependencies = []

diff --git a/setup.py b/setup.py
@@ -1,5 +1,6 @@
 #!/usr/bin/env python
 # flake8: noqa
+import io
 from contextlib import suppress
 from os import fspath
 from pathlib import Path
@@ -51,6 +52,13 @@ def run(self) -> None:
                 f"Unable to find regexes.yaml, should be at {yaml_src!r}"
             )
 
+        def write_matcher(f, typ: str, fields: List[Optional[object]]):
+            f.write(f"        {typ}(".encode())
+            while len(fields) > 1 and fields[-1] is None:
+                fields = fields[:-1]
+            f.write(", ".join(map(repr, fields)).encode())
+            f.write(b"),\n")
+
         def write_params(fields):
             # strip trailing None values
             while len(fields) > 1 and fields[-1] is None:
@@ -70,10 +78,20 @@ def write_params(fields):
         outdir = dist_dir / self.pkg_name
         outdir.mkdir(parents=True, exist_ok=True)
 
-        dest = outdir / "_regexes.py"
+        dest = outdir / "_matchers.py"
+        dest_legacy = outdir / "_regexes.py"
 
-        with dest.open("wb") as fp:
+        with dest.open("wb") as f, dest_legacy.open("wb") as fp:
             # fmt: off
+            f.write(b"""\
+########################################################
+# NOTICE: this file is autogenerated from regexes.yaml #
+########################################################
+
+from .core import Matchers, UserAgentMatcher, OSMatcher, DeviceMatcher
+
+MATCHERS: Matchers = ([
+""")
             fp.write(b"# -*- coding: utf-8 -*-\n")
             fp.write(b"########################################################\n")
             fp.write(b"# NOTICE: This file is autogenerated from regexes.yaml #\n")
@@ -87,31 +105,35 @@ def write_params(fields):
             fp.write(b"\n")
             fp.write(b"USER_AGENT_PARSERS = [\n")
             for device_parser in regexes["user_agent_parsers"]:
-                fp.write(b"    UserAgentParser(\n")
-                write_params([
+                write_matcher(f, "UserAgentMatcher", [
                     device_parser["regex"],
                     device_parser.get("family_replacement"),
                     device_parser.get("v1_replacement"),
                     device_parser.get("v2_replacement"),
                 ])
-                fp.write(b"    ),\n")
-            fp.write(b"]\n")
-            fp.write(b"\n")
-            fp.write(b"DEVICE_PARSERS = [\n")
-            for device_parser in regexes["device_parsers"]:
-                fp.write(b"    DeviceParser(\n")
+
+                fp.write(b"    UserAgentParser(\n")
                 write_params([
                     device_parser["regex"],
-                    device_parser.get("regex_flag"),
-                    device_parser.get("device_replacement"),
-                    device_parser.get("brand_replacement"),
-                    device_parser.get("model_replacement"),
+                    device_parser.get("family_replacement"),
+                    device_parser.get("v1_replacement"),
+                    device_parser.get("v2_replacement"),
                 ])
                 fp.write(b"    ),\n")
-            fp.write(b"]\n")
-            fp.write(b"\n")
+            f.write(b"    ], [\n")
+            fp.write(b"]\n\n")
+
             fp.write(b"OS_PARSERS = [\n")
             for device_parser in regexes["os_parsers"]:
+                write_matcher(f, "OSMatcher", [
+                    device_parser["regex"],
+                    device_parser.get("os_replacement"),
+                    device_parser.get("os_v1_replacement"),
+                    device_parser.get("os_v2_replacement"),
+                    device_parser.get("os_v3_replacement"),
+                    device_parser.get("os_v4_replacement"),
+                ])
+
                 fp.write(b"    OSParser(\n")
                 write_params([
                     device_parser["regex"],
@@ -122,6 +144,29 @@ def write_params(fields):
                     device_parser.get("os_v4_replacement"),
                 ])
                 fp.write(b"    ),\n")
+            f.write(b"    ], [\n")
+            fp.write(b"]\n\n")
+
+            fp.write(b"DEVICE_PARSERS = [\n")
+            for device_parser in regexes["device_parsers"]:
+                write_matcher(f, "DeviceMatcher", [
+                    device_parser["regex"],
+                    device_parser.get("regex_flag"),
+                    device_parser.get("device_replacement"),
+                    device_parser.get("brand_replacement"),
+                    device_parser.get("model_replacement"),
+                ])
+
+                fp.write(b"    DeviceParser(\n")
+                write_params([
+                    device_parser["regex"],
+                    device_parser.get("regex_flag"),
+                    device_parser.get("device_replacement"),
+                    device_parser.get("brand_replacement"),
+                    device_parser.get("model_replacement"),
+                ])
+                fp.write(b"    ),\n")
+            f.write(b"])\n")
             fp.write(b"]\n")
             # fmt: on