feat: don't store duplicate events in the notice queue #1372

tonyandrewmeyer · 2024-09-13T05:48:21Z

When a new event is emitted, check if there is already an exact duplicate (the same observer, same event type, and same event snapshot) in the storage from a deferral in a previous run. If a duplicate does exist, then don't store a new notice and snapshot.

There are performance implications:

For every event emitted, there's a full iteration through the notices queue. If there are no deferred events, then the notices queue will be empty, so in the majority of cases this should only be one additional storage query (to get the empty result). If the deferral queue is very large, then this may be noticeable, although the point of the change is to reduce the length of the queue, because there are already issues when it's large.
For each notice in the queue (again: normally none) the snapshot is loaded (a storage query; currently all snapshots are quite small in size) for any events that have matching observers and kinds. If the queue had a lot of events of the same type, it's likely the observer would match, and if the snapshots were different (for example, many secret-changed events but for different secrets) then there is a cost to doing the comparison.
In cases where the queue currently builds up with a lot of duplicates, there will be a significant performance improvement, because only one of the (notice+snapshot) events will be processed each run.

There is also a (deliberate) behaviour change that does impact event ordering. For example, consider this sequence of events:

config-changed deferred
secret-changed deferred
config-changed

Currently, this would result in:

ConfigChangedEvent
ConfigChangedEvent, SecretChangedEvent
ConfigChangedEvent, SecretChangedEvent, ConfigChangedEvent

With this change, this would result in:

ConfigChangedEvent
ConfigChangedEvent, SecretChangedEvent
ConfigChangedEvent, SecretChangedEvent

More generally, there could currently be duplicate (notice+snapshot) events intermixed throughout the queue, and each run they will be re-emitted in the order in which they originally occurred. With this change, they will be re-emitted in the order in which they originally occurred, except if they have already emitted this run. The particularly noticeable change is that the Juju event that triggered the run may not be the last event (if it was a duplicate of one in the queue).

We could potentially do this differently - for example, updating the sequence so that when a duplicate occurs it moves the event to the end of the queue (by dropping and adding the notice+snapshot, or explicitly setting the sequence field for SQL and by just reordering the list for Juju). This would add complexity and have a performance penalty, however, and it seems more correct to have the original order.

For unit tests:

Harness: this change is incompatible with some Harness use - specifically, if the test code emits the same event more than once, where it's deferred at least once, there is no reemit() call like there would be in production, and the test expects the handler to be called more than once. For this reason, the skipping is disabled for Harness.
Scenario: Scenario is more explicit with deferred events, so if you want to have had the 'skipping' behaviour occur before the event you are ctx.runing then you need to manage that in the list of deferred events passed into the State. We could add a consistency check to alert if there are duplicates in that list (this would be easier to do when the Scenario code is in this repo). However, the Scenario behaviour does still change: if the event is deferred in the ctx.run and there's already a match in the state's deferred list, then the new (deferred) event does not get added, and the output state doesn't change (which is what we want). We get this behaviour automatically because Scenario mimics the runtime behaviour more closely, actually running the framework emitting/reemitting. So: Scenario tests are both safer, and can be used to match the new behaviour.

This can be tested manually with a charm that optionally defers events, e.g. with the code below.

charm.py and charmcraft.yaml content

class NoticeQueueCharm(ops.CharmBase):
    def __init__(self, framework: ops.Framework):
        super().__init__(framework)
        framework.observe(self.on.config_changed, self._on_config_changed)
        framework.observe(self.on.secret_changed, self._on_secret_changed)

    def _on_config_changed(self, event):
        logger.info("Running config-changed")
        if self.config.get("secretopt"):
            secret = self.model.get_secret(id=self.config["secretopt"])
            # Get the content so we are 'subscribed' to the updates.
            secret.get_content()
        if self.config.get("secretopt2"):
            secret = self.model.get_secret(id=self.config["secretopt2"])
            # Get the content so we are 'subscribed' to the updates.
            secret.get_content()
        if self.config.get("opt", "").startswith("defer"):
            event.defer()

    def _on_secret_changed(self, event):
        logger.info("Running secret-changed")
        if self.config.get("opt", "").startswith("defer"):
            event.defer()

config:
  options:
    opt:
      description: dummy option to trigger config-changed
    secretopt:
      type: secret
      description: a user secret
    secretopt2:
      type: secret
      description: a user secret

If you want to see the queue while you're doing this, you can use code like this:

        store = self.framework._storage
        for event_path, observer_path, method_name in store.notices(None):
            handle = ops.Handle.from_path(event_path)
            snapshot_data = store.load_snapshot(handle.path)
            logger.info(
                "event_path: %s, observer_path: %s, method_name: %s, snapshot data: %r",
                event_path, observer_path, method_name, snapshot_data,
            )

If you set opt to anything not starting with "defer" then you should get a config-changed event every time. If you set it to something starting with "defer", then it will run exactly once each time you set the config (remember to change the value, or Juju will skip the event) - with ops@main you'll instead get a config-changed event for every time that you change the config, every time you change it (ie. the queue will build up).

You can also check it with an event that has a snapshot, such as secret-changed. If the config is set to defer, every time you change the content of the first secret, you'll get one secret-changed event (but with ops@main, each time you'll get multiple, depending on how many times you've done it). If you also change the second secret, you'll get two secret-changed events, one for each secret (because the snapshots differ).

You can intermix the different events and should always have exactly zero or one (event type+snapshot) in the queue. If you change the opt value back to something not starting with "defer", then you should see all the events complete and have an empty queue.

Scenario test that shows the behaviour

import ops
from ops import testing

class MyCharm(ops.CharmBase):
    def __init__(self, framework):
        super().__init__(framework)
        framework.observe(self.on.secret_changed, self._on_sc)
        framework.observe(self.on.update_status, self._on_us)

    def _on_us(self, event):
        print("update-status", event)

    def _on_sc(self, event):
        print("secret-changed", event)
        event.defer()


ctx = testing.Context(MyCharm, meta={"name": "foo"})
secret = testing.Secret({"foo": "bar"})
devent1 = ctx.on.update_status().deferred(MyCharm._on_us)
devent2 = ctx.on.secret_changed(secret).deferred(MyCharm._on_sc)
state_in = testing.State(secrets={secret}, deferred=[devent1, devent2])
state_out = ctx.run(ctx.on.secret_changed(secret), state_in)

assert state_out.unit_status == testing.UnknownStatus()
print(state_out.deferred)

Note that this requires fixing a small bug in Secret snapshoting (PR). That's unrelated to this change - it also occurs if you use deferred secrets in main.

Fixes #935

tonyandrewmeyer · 2024-09-13T05:49:13Z

Although this seems like a fairly small change, and it's in one very specific piece of the code, it's at the heart of running every event, so very thorough review is appreciated. I'm also happy to discuss the performance implications as well as the implications of the behaviour change.

tonyandrewmeyer · 2024-09-13T05:52:53Z

Will move back out of draft once I've investigated the charm failures - no point reviewing until then.

dimaqq · 2024-09-19T07:00:05Z

ops/framework.py

+            if (
+                existing_observer_path != observer_path
+                or existing_method_name != method_name
+                or existing_event_path.split('[')[0] != event_path.split('[')[0]


Looking through class Handle, it seems that not all paths have a "key" index.

That's true, but all events (from BoundEvent's emit) do have one, and that's the only type of object that has its path in the notices table.

If the event_path somehow was missing the key, then the code would also still work, because it's getting the 0th element so it will still be checking the full path - there just won't be a [1:] that gets discarded.

This assumption/logic deserves a comment, or maybe a unit test.
I'd be lost if I tried to eyeball this without your comment.

ops/framework.py

tonyandrewmeyer · 2024-09-23T00:48:23Z

Will move back out of draft once I've investigated the charm failures - no point reviewing until then.

Code like this doesn't work with Harness:

class MyCharm(ops.CharmBase):
    ...
    def _on_config_changed(self, event):
        if self.something():
            event.defer()
        self.do_something()

def test_config_changed():
    h = testing.Harness()
    h.begin()
    MyCharm.something = lambda _: False
    h.charm.on.config_changed.emit()
    MyCharm.something = lambda _: True
    h.charm.on.config_changed.emit()
    assert MyCharm.do_something.called()  # Assume this was patched somewhere

The framework skips the second emit. Arguably, the test should clean up in some way because it's actually going to run the config-changed handler three times (once deferred, then twice when it doesn't defer, once for the deferred notice and once for the new one) - it could be broken into multiple tests or use a fresh Harness or could just call reemit() rather than emit() to get the deferred event rather than a new one.

It seems likely (I'll do a quick check to be sure) that this pattern is too common to have the change in this way - I don't think we can expect people to update their tests to account for this, since it's not really a bug, just not the ideal way to write the test. Maybe the 'skip duplicate defers' will need to be explicitly enabled?

tonyandrewmeyer · 2024-10-04T05:07:01Z

I did a super-tox (-e unit) run and there were no new failures ... however, since this behaviour isn't active when using Harness, that's not as informative as it would be. I guess I could try doing an -e integration although I'm not sure how much coverage that provides (particularly for deferring).

Docs: Ops: Add top-level intro, clean up ops.module intro in light of that, flesh out ops.pebble and ops.testing intros a bit more https://warthogs.atlassian.net/browse/CHARMTECH-183 - [x] "home page" library reference - [x] ops module - [x] ops.main entry point - [x] ops.pebble module - [x] ops.testing module

…names in ops.testing (canonical#1381) Add a new optional install `testing`, e.g. `pip install ops[testing]`. This pulls in a compatible version of `ops-scenario`, and exposes the Scenario names in the `ops.testing` namespace, alongside Harness. `pip install ops[harness]` is also supported to ease the (far off, presumably 3.0) transition to Harness being moved out of the base install. It currently installs no extra dependencies, so is the same as `pip install ops` but a forward-looking charm would use `pip install ops[harness]` (or `pip install ops[harness, testing]`) if using Harness. Requires ops-scenario 7.0.5, which has the required adjustments to support insertion into `ops.testing`. The `ActionFailed` name exists in both `ops._private.harness` and `scenario.context`. This is handled by adjusting the Harness class to support the functionality of both and monkeypatching that into Scenario until Scenario starts using it. It's compatible with both Harness and Scenario, but will have empty data in an attribute (which attribute depends on which framework is used). The `Container` name in `ops.testing`, which is only present for backwards compatibility, is also overwritten if ops-scenario is installed. If anyone is actually using `ops.testing.Container` instead of `ops.Container` then they'll need to fix their code before using `ops[testing]` (or ops-scenario combined with the release of ops with this change). A very basic unit test is added to make sure that Scenario tests work (all the actual Scenario tests are in the ops-scenario repo) if ops-scenario/ops[testing] is installed (this is the case for `tox -e unit`). A test is also added to ensure that all of the Scenario names are documented, since `automodule` isn't used any more. Also adjusts the documentation to include the new framework in ops.testing. --------- Co-authored-by: Ben Hoyt <[email protected]> Co-authored-by: Dima Tisnek <[email protected]>

tonyandrewmeyer · 2024-10-04T05:50:47Z

Ok, this should be good for an initial review now.

benhoyt

Thanks, this looks good to me, and it's a nice one to get fixed. Could we do a couple more tests before merging, just in case:

You've called out the performance changes, but for a sanity check, could you please do a couple of measurements? What's a typical time for load the notices given and empty queue? And then if the queue is non-empty, what's the time to load each snapshot? I'm guessing sub-milliseconds in both cases, but would be good to confirm.
I think trying to run super-tox -e integration will be too slow / problematic, but could we at least please spot check tox -e integration on a couple of charms that use defer?

dimaqq

Sorry I may not have thought this through.

Is it possible to get 2 distinct events from Juju that only differ in arguments?

For example, 2 secrets are now available, the secret id is different, but everything else is same (event name, observer function, human name if any).

dimaqq · 2024-10-08T04:19:44Z

ops/framework.py

+            except NoSnapshotError:
+                existing_event_data = {}


This code branch feels wrong.
If an event cannot be loaded, should it not be discarded instead?
Or do we have saved snapshots with no data?

dimaqq · 2024-10-08T04:20:44Z

ops/framework.py

+            if (
+                existing_observer_path != observer_path
+                or existing_method_name != method_name
+                or existing_event_path.split('[')[0] != event_path.split('[')[0]


This assumption/logic deserves a comment, or maybe a unit test.
I'd be lost if I tried to eyeball this without your comment.

dimaqq · 2024-10-08T04:23:08Z

test/test_framework.py

+        # Emit an event, which will be deferred.
+        pub.d.emit('foo')
+        notices = tuple(framework._storage.notices())
+        assert len(notices) == 2  # One per observer.
+        assert framework._storage.load_snapshot(notices[0][0]) == {'data': 'foo'}
+
+        # Emit the same event, and we'll still just have the single notice.
+        pub.d.emit('foo')
+        assert len(tuple(framework._storage.notices())) == 2
+
+        # Emit the same event kind but with a different snapshot, and we'll get a new notice.
+        pub.d.emit('bar')
+        notices = tuple(framework._storage.notices())
+        assert len(notices) == 4
+        assert framework._storage.load_snapshot(notices[2][0]) == {'data': 'bar'}
+
+        # Emit a totally different event, and we'll get a new notice.
+        pub.n.emit()
+        notices = tuple(framework._storage.notices())
+        assert len(notices) == 6
+        assert framework._storage.load_snapshot(notices[2][0]) == {'data': 'bar'}
+        assert framework._storage.load_snapshot(notices[4][0]) == {}
+
+        # Even though these events are far back in the queue, since they're
+        # duplicates, they will get skipped.
+        pub.d.emit('foo')
+        pub.d.emit('bar')
+        pub.n.emit()
+        assert len(tuple(framework._storage.notices())) == 6
+
+        def notices_for_observer(n: int):
+            return [
+                notice for notice in framework._storage.notices() if notice[1].endswith(f'[{n}]')
+            ]
+
+        # Stop deferring on the first observer, and all those events will be
+        # completed and the notices removed, while the second observer will
+        # still have them queued.
+        pub.r.emit()
+        assert len(tuple(framework._storage.notices())) == 6
+        pub.n.emit()
+        framework.reemit()
+        assert len(notices_for_observer(1)) == 0
+        assert len(notices_for_observer(2)) == 3
+
+        # Without the defer active, the first observer always ends up with an
+        # empty queue, while the second observer's queue continues to skip
+        # duplicates and add new events.
+        pub.d.emit('foo')
+        pub.d.emit('foo')
+        pub.d.emit('bar')
+        pub.n.emit()
+        pub.d.emit('foo')
+        pub.d.emit('bar')
+        pub.n.emit()
+        pub.d.emit('baz')
+        framework.reemit()
+        assert len(notices_for_observer(1)) == 0
+        assert len(notices_for_observer(2)) == 4


Please consider refactoring the setup and splitting this flow into separate tests.

tonyandrewmeyer requested review from dimaqq and benhoyt September 13, 2024 05:48

tonyandrewmeyer marked this pull request as draft September 13, 2024 05:52

dimaqq reviewed Sep 19, 2024

View reviewed changes

ops/framework.py Show resolved Hide resolved

tonyandrewmeyer and others added 5 commits October 4, 2024 18:38

Avoid storing duplicate notice+snapshots in the queue.

713d8ca

Disable the skip behaviour when using Harness.

6d37cca

Rename variable to make codespell happier.

1701f87

tonyandrewmeyer force-pushed the skip-defer-dupes-935 branch from e8d1ec8 to 1701f87 Compare October 4, 2024 05:48

Fix bad rebase.

6de8b49

tonyandrewmeyer marked this pull request as ready for review October 4, 2024 05:50

benhoyt approved these changes Oct 6, 2024

View reviewed changes

dimaqq reviewed Oct 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: don't store duplicate events in the notice queue #1372

feat: don't store duplicate events in the notice queue #1372

tonyandrewmeyer commented Sep 13, 2024 •

edited

Loading

tonyandrewmeyer commented Sep 13, 2024

tonyandrewmeyer commented Sep 13, 2024

dimaqq Sep 19, 2024

tonyandrewmeyer Oct 4, 2024

dimaqq Oct 8, 2024

tonyandrewmeyer commented Sep 23, 2024

tonyandrewmeyer commented Oct 4, 2024 •

edited

Loading

tonyandrewmeyer commented Oct 4, 2024

benhoyt left a comment

dimaqq left a comment

dimaqq Oct 8, 2024

dimaqq Oct 8, 2024

dimaqq Oct 8, 2024

feat: don't store duplicate events in the notice queue #1372

Are you sure you want to change the base?

feat: don't store duplicate events in the notice queue #1372

Conversation

tonyandrewmeyer commented Sep 13, 2024 • edited Loading

tonyandrewmeyer commented Sep 13, 2024

tonyandrewmeyer commented Sep 13, 2024

dimaqq Sep 19, 2024

Choose a reason for hiding this comment

tonyandrewmeyer Oct 4, 2024

Choose a reason for hiding this comment

dimaqq Oct 8, 2024

Choose a reason for hiding this comment

tonyandrewmeyer commented Sep 23, 2024

tonyandrewmeyer commented Oct 4, 2024 • edited Loading

tonyandrewmeyer commented Oct 4, 2024

benhoyt left a comment

Choose a reason for hiding this comment

dimaqq left a comment

Choose a reason for hiding this comment

dimaqq Oct 8, 2024

Choose a reason for hiding this comment

dimaqq Oct 8, 2024

Choose a reason for hiding this comment

dimaqq Oct 8, 2024

Choose a reason for hiding this comment

tonyandrewmeyer commented Sep 13, 2024 •

edited

Loading

tonyandrewmeyer commented Oct 4, 2024 •

edited

Loading