refactor(perf): Lazy rewrite of `schemapi.py`, improve docs #3547

dangotbanned · 2024-08-19T22:22:50Z

Description

I went down a few too many rabbit holes in this PR.

This comment has an overview of most of the changes.

Misc

Removed this in 8002dab (#3547) but if anyone was curious:

altair.tests.utils.test_schemapi.py

...

_SKIP_SLOW_BENCHMARKS: bool = True
_REPEAT_TIMES = 1000


@pytest.mark.parametrize("to_or_from", ["to_dict-validate", "to_dict", "from_dict"])
@pytest.mark.filterwarnings("ignore:.*:UserWarning")
@pytest.mark.skipif(
    _SKIP_SLOW_BENCHMARKS,
    reason="Should only be run in isolation to test single threaded performance.",
)
def test_chart_validation_benchmark(
    to_or_from: Literal["to_dict-validate", "to_dict", "from_dict"],
) -> None:
    """
    Intended to isolate `Chart.(to|from)_dict.` calls.

    Repeated ``_REPEAT_TIMES`` times, non-parametric:
    - in an attempt to limit the potential overhead of ``pytest``
    - but enforce ``1`` thread, like a user-code would be.

    Results
    -------
    ```
    _REPEAT_TIMES = 1000
    pytest -k test_chart_validation_benchmark  --numprocesses=3 --durations=3 tests

    # Pre-`SchemaBase.from_dict` refactor (3.12.3)
    108.16s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict-validate]
    84.62s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[from_dict]
    66.71s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict]

    # Post-`SchemaBase.from_dict` refactor (3.12.3)
    107.84s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict-validate]
    50.43s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[from_dict]
    67.07s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict]

    # Post-`SchemaBase.__init_subclass__` addition (3.12.3)
    108.24s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict-validate]
    50.33s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[from_dict]
    66.51s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict]

    # Post-`dict` branch micro optimization in `_FromDict.from_dict` (3.12.3)
    107.90s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict-validate]
    49.63s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[from_dict]
    66.87s call     tests/utils/test_schemapi.py::test_chart_validation_benchmark[to_dict]
    ```
    """
    from itertools import chain, repeat

    if TYPE_CHECKING:
        from altair.typing import ChartType

    def _iter_charts() -> Iterator[ChartType]:
        """
        Ensures only len(chart_funcs_error_message) actual charts are constructed.

        The `to_dict` calls are what gets multiplied
        """
        charts: list[ChartType] = [fn() for fn, _ in chart_funcs_error_message]
        yield from chain.from_iterable(repeat(charts, times=_REPEAT_TIMES))

    def _iter_chart_factory() -> Iterator[ChartType]:
        """
        Validation not the bottleneck, but encode is.

        Ensures at least `times` * len(chart_funcs_error_message) .encode calls are made.
        """
        chart_funcs: list[Callable[[], ChartType]] = [
            fn for fn, _ in chart_funcs_error_message
        ]
        for fn in chain.from_iterable(repeat(chart_funcs, times=_REPEAT_TIMES)):
            yield fn()

    def _to_dict(validate: bool) -> None:
        if validate:
            for chart in _iter_charts():
                with pytest.raises(schemapi.SchemaValidationError):
                    chart.to_dict(validate=validate)
        else:
            for chart in _iter_charts():
                chart.to_dict(validate=validate)

    if to_or_from == "to_dict":
        _to_dict(validate=False)
    elif to_or_from == "to_dict-validate":
        _to_dict(validate=True)
    else:
        assert list(_iter_chart_factory())

Every call was identical as it was based on an existing constant `jsonschema_version_str`

I'm going to do this a lot. Docstrings can be collapsed in all editors and can benefit from markdown. Everything here is already private, so using long comments has no benefit

`typeshed` disagrees with `jsonschema`, this is just enforcing what `jsonschema` says is true

Produces the same result, but skips the upfront `deepcopy`. No longer modifying the copy inplace, new objects are created inside the iterator.

https://github.com/python/typeshed/blob/937270df0c25dc56a02f7199f1943fdb7d47aa9d/stubs/jsonschema/jsonschema/protocols.pyi#L11

Previously, using a version below `4.0.1` would still always check first if there was a property. This would not change between checks. Defining in this style removes the need for as much documentation, since the version guards are very clear when each branch is used.

First non-failing version. Have left most of the original code in. Planning to migrate & adapt the comments before removing. #

tools/schemapi/schemapi.py

vega#3547 (comment)

**Remove before review**. Using for quicker feedback loop, where running mypy and all tests are not beneficial

vega#3547 (comment)

dangotbanned · 2024-08-20T13:17:41Z

Pulled out alternative error message idea from #3547 (comment) out for later discussion.

The issue this was attached to has been resolved, but I still think this is an area the SchemaValidationError message could be refined.

Note

Originally added to use like below.
However this was naive, due to this skipping disjoint enums.

from functools import reduce

def _lazy_enum(iterable: Iterable[ValidationError], /) -> Iterator[ValidationError]:
    yield reduce(_enum_inner, iterable)

Possible alternative

Rather than preserving each enum, we could just extract the union of them.

It would produce a different error message than the original, so for the below:

altair/tests/utils/test_schemapi.py

Lines 779 to 788 in 591bf40

    
                   chart_error_example__invalid_timeunit_value, 
        
                   r"""'invalid_value' is an invalid value for `timeUnit`. Valid values are: 
        
                           - One of \['year', 'quarter', 'month', 'week', 'day', 'dayofyear', 'date', 'hours', 'minutes', 'seconds', 'milliseconds'\] 
        
                           - One of \['utcyear', 'utcquarter', 'utcmonth', 'utcweek', 'utcday', 'utcdayofyear', 'utcdate', 'utchours', 'utcminutes', 'utcseconds', 'utcmilliseconds'\] 
        
                           - One of \['yearquarter', 'yearquartermonth', 'yearmonth', 'yearmonthdate', 'yearmonthdatehours', 'yearmonthdatehoursminutes', 'yearmonthdatehoursminutesseconds', 'yearweek', 'yearweekday', 'yearweekdayhours', 'yearweekdayhoursminutes', 'yearweekdayhoursminutesseconds', 'yeardayofyear', 'quartermonth', 'monthdate', 'monthdatehours', 'monthdatehoursminutes', 'monthdatehoursminutesseconds', 'weekday', 'weekdayhours', 'weekdayhoursminutes', 'weekdayhoursminutesseconds', 'dayhours', 'dayhoursminutes', 'dayhoursminutesseconds', 'hoursminutes', 'hoursminutesseconds', 'minutesseconds', 'secondsmilliseconds'\] 
        
                           - One of \['utcyearquarter', 'utcyearquartermonth', 'utcyearmonth', 'utcyearmonthdate', 'utcyearmonthdatehours', 'utcyearmonthdatehoursminutes', 'utcyearmonthdatehoursminutesseconds', 'utcyearweek', 'utcyearweekday', 'utcyearweekdayhours', 'utcyearweekdayhoursminutes', 'utcyearweekdayhoursminutesseconds', 'utcyeardayofyear', 'utcquartermonth', 'utcmonthdate', 'utcmonthdatehours', 'utcmonthdatehoursminutes', 'utcmonthdatehoursminutesseconds', 'utcweekday', 'utcweekdayhours', 'utcweekdayhoursminutes', 'utcweekdayhoursminutesseconds', 'utcdayhours', 'utcdayhoursminutes', 'utcdayhoursminutesseconds', 'utchoursminutes', 'utchoursminutesseconds', 'utcminutesseconds', 'utcsecondsmilliseconds'\] 
        
                           - One of \['binnedyear', 'binnedyearquarter', 'binnedyearquartermonth', 'binnedyearmonth', 'binnedyearmonthdate', 'binnedyearmonthdatehours', 'binnedyearmonthdatehoursminutes', 'binnedyearmonthdatehoursminutesseconds', 'binnedyearweek', 'binnedyearweekday', 'binnedyearweekdayhours', 'binnedyearweekdayhoursminutes', 'binnedyearweekdayhoursminutesseconds', 'binnedyeardayofyear'\] 
        
                           - One of \['binnedutcyear', 'binnedutcyearquarter', 'binnedutcyearquartermonth', 'binnedutcyearmonth', 'binnedutcyearmonthdate', 'binnedutcyearmonthdatehours', 'binnedutcyearmonthdatehoursminutes', 'binnedutcyearmonthdatehoursminutesseconds', 'binnedutcyearweek', 'binnedutcyearweekday', 'binnedutcyearweekdayhours', 'binnedutcyearweekdayhoursminutes', 'binnedutcyearweekdayhoursminutesseconds', 'binnedutcyeardayofyear'\] 
        
                           - Of type 'object'$""",

We could adapt the formatting from SchemaValidationError._format_params_as_table to print something like this:

Large Screenshot

Which would just read as:

'invalid_value' is an invalid value for timeUnit. Valid values are:

One of [...]

Of type 'object'

This would also be easier to control the formatting than relying on the groups provided by the enums alone

Note to self: Fill out comments re individual changes.

Temporary, will remove before review. Tried to isolate to a single function so that I can reproduce on main

I renamed this from `_SLOW_BENCHMARKS` but forgot to invert the bool lol

Wasn't able to demonstrate a performance improvement

dangotbanned · 2024-09-03T10:15:24Z

altair/vegalite/v5/schema/core.py

 class VegaLiteSchema(SchemaBase):
+    _schema = load_schema()
    _rootschema = load_schema()


Linking this back to:

refactor: Ensure every VegaLiteSchema has a ._schema

Has the same result for Root, since it inherits _schema.

The main benefit is a stronger guarantee that every derived class is non-None for both attributes

dangotbanned · 2024-09-03T12:03:42Z

Edit

See narwhals-dev/narwhals#905

Bug report unrelated to PR

@MarcoGorelli do you have any ideas on what is causing this error for ibis on python==3.8?

https://github.com/vega/altair/actions/runs/10681007191/job/29608118357?pr=3547

I did a merge update from the GitHub UI, but haven't been able to track down the cause yet.

Updated

Seems to be this commit which causes the issue (only for 3.8) narwhals-dev/narwhals@3d246b7

..\..\..\AppData\Local\hatch\env\virtual\altair\CXM7NV9I\hatch-test.py3.8\lib\site-packages\narwhals\dependencies.py:128: in is_ibis_table
    return bool((ibis := get_ibis()) is not None and isinstance(df, ibis.Table))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  

name = 'Table'

    def __getattr__(name: str) -> BaseBackend:
        """Load backends in a lazy way with `ibis.<backend-name>`.

        This also registers the backend options.

        Examples
        --------
        >>> import ibis
        >>> con = ibis.sqlite.connect(...)

        When accessing the `sqlite` attribute of the `ibis` module, this function
        is called, and a backend with the `sqlite` name is tried to load from
        the `ibis.backends` entrypoints. If successful, the `ibis.sqlite`
        attribute is "cached", so this function is only called the first time.
        """
        entry_points = {ep for ep in util.backend_entry_points() if ep.name == name}

        if not entry_points:
            msg = f"module 'ibis' has no attribute '{name}'. "
            if name in _KNOWN_BACKENDS:
                msg += f"""If you are trying to access the '{name}' backend,
                        try installing it first with `pip install ibis-{name}`"""
>           raise AttributeError(msg)
E           AttributeError: module 'ibis' has no attribute 'Table'.

..\..\..\AppData\Local\hatch\env\virtual\altair\CXM7NV9I\hatch-test.py3.8\lib\site-packages\ibis\__init__.py:58: AttributeError

Used in `Chart.from_dict`

dangotbanned · 2024-09-03T14:28:07Z

tools/schemapi/schemapi.py

+__all__ = [
+    "Optional",  # altair.utils
+    "SchemaBase",  # altair.vegalite.v5.schema.core
+    "Undefined",  # altair.utils
+    "UndefinedType",  # altair.vegalite.v5.schema.core -> (side-effect relied on to propagate to alt.__init__)
+    "_is_valid",  # altair.vegalite.v5.api
+    "_resolve_references",  # tools.schemapi.utils -> tools.generate_schema_wrapper
+    "_subclasses",  # altair.vegalite.v5.schema.core
+    "is_undefined",  # altair.typing
+    "validate_jsonschema",  # altair.utils.display
+    "with_property_setters",  # altair.vegalite.v5.schema.channels
+]


Keeping track of the refs here, since you can't follow the references from the tools copy

All were notes added earlier in PR, but not needed now

WIll put this on another branch, won't improve performance so not relevant here vega#3547 (comment)

Feeling I've squeezed out all the performance I can for now. Will add in a collpased comment on the PR for reference

binste · 2024-09-04T09:11:11Z

Quite the PR!! Will take me a while to dig through it but thanks for going down all these rabbit holes, hope you came out unharmed! ;) I like jsonschema validation but it also can be a wild place, especially with the mapping of the schema to Altair objects...

Do I understand this correct that the main motivation for this PR is to improve the performance and that the main impact on performance is that from_dict is considerably faster based on your benchmark (which is super cool!), where as there is no change in to_dict and the validation?

dangotbanned · 2024-09-04T10:51:15Z

Quite the PR!! Will take me a while to dig through it but thanks for going down all these rabbit holes, hope you came out unharmed! ;) I like jsonschema validation but it also can be a wild place, especially with the mapping of the schema to Altair objects...

Thanks and yeah no rush @binste - I know there is a lot to go through - even after all the stuff I removed over the past few days.

Do I understand this correct that the main motivation for this PR is to improve the performance and that the main impact on performance is that from_dict is considerably faster based on your benchmark (which is super cool!), where as there is no change in to_dict and the validation?

So it is tricky.

I found it quite difficult to benchmark this, since there are so many interconnected parts.
I can confidently say from_dict is faster, but of the three in the benchmark that is the only one which benefitted.

We also have much better performance on 3.8-3.11 with changes like 933c045 (#3547).
All versions are pretty much equal now on the second run of hatch test --all.

I would like to - in a future PR - develop some more isolated benchmarking tools.
Ideally, these would be based on specs that have proved to be slow in real-world use, so that we can more easily get to the root cause

binste · 2024-09-22T15:12:30Z

I very much appreciate the time and effort you put into this and I believe some/many/maybe all changes here could be great improvements to Altair. However, as it's such a large and complex PR, I'm struggling to review this PR to an extent where I'd feel comfortable approving it.

Although improvements to the performance of from_dict are great, it's a feature that is not used by many users. I'm worried that the time someone would need to invest to review this PR could be better used elsewhere as well as the added complexity to the codebase which is a lot (it's hard to find developers familiar enough with jsonschema which want to work on Altair, the more complicated we make this part, the harder this gets).

Would you be willing to split this PR into multiple ones? They could be put up for review and merged one after the other with this PR (in draft mode) being the resource to cherry-pick commits from. I think this would help with reviewing as well as with judging the tradeoff between added complexity of some of the changes vs. improved performance.

Python 3.8-3.11 performance improvements: You mentioned that this mainly comes from 933c045. This could be a separate PR, easy to merge, and already a big win for users.
from_dict performance improvements: Do you have a feeling for which changes lead to this improvement? Is it everything in the current PR or can it be boiled down to fewer more isolated changes?

I hope this is ok for you and that you understand my position, i.e. that it does not feel random. We can also do more pre-alignment on rough drafts before starting to work on detailed PRs. Also something we can pick up in our upcoming call.

dangotbanned · 2024-09-22T17:07:06Z

#3547 (comment)

I hope this is ok for you and that you understand my position, i.e. that it does not feel random.

@binste
Honestly I would say the main benefit of this PR was for me to understand this part of altair a little better.
That is, the process of working on it I see as most valuable - so please don't worry I totally understand that it isn't an easy review.

Would you be willing to split this PR into multiple ones?

Yeah that sounds like a good idea.
I will need to have a think about how best to do that, but at the very least I could get to the same outcome with a cleaner history and less experimentation.

I'll loop back to this at some point and see what smaller PRs I can make out of it.

Appreciate your feedback on this, regardless of the outcome

dangotbanned added 12 commits August 18, 2024 21:57

perf: Replace _use_referencing_library() with a constant

d95f332

Every call was identical as it was based on an existing constant `jsonschema_version_str`

docs: Move _get_errors_from_spec comment into docstring

f4a4e0e

I'm going to do this a lot. Docstrings can be collapsed in all editors and can benefit from markdown. Everything here is already private, so using long comments has no benefit

fix(typing): Resolve jsonschema incomplete stubs issue

6dfe61b

`typeshed` disagrees with `jsonschema`, this is just enforcing what `jsonschema` says is true

refactor: Reuse None in ternary expression

e9a4beb

perf: Replace _prepare_references_in_schema

c534876

Produces the same result, but skips the upfront `deepcopy`. No longer modifying the copy inplace, new objects are created inside the iterator.

chore: Add note on _get_errors_from_spec

e30cac6

build: run generate-schema-wrapper

e702b26

refactor(typing): Use stubs type _JsonParameter

af878d0

https://github.com/python/typeshed/blob/937270df0c25dc56a02f7199f1943fdb7d47aa9d/stubs/jsonschema/jsonschema/protocols.pyi#L11

refactor: Shorten some references to ValidationError

5ca44df

build: run generate-schema-wrapper

05976a6

perf(DRAFT): Initial lazy validation

af6e41c

First non-failing version. Have left most of the original code in. Planning to migrate & adapt the comments before removing. #

dangotbanned added enhancement maintenance labels Aug 19, 2024

Merge remote-tracking branch 'upstream/main' into perf-validation

b24ac45

dangotbanned commented Aug 19, 2024

View reviewed changes