-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: don't store duplicate events in the notice queue #1372
base: main
Are you sure you want to change the base?
feat: don't store duplicate events in the notice queue #1372
Conversation
Although this seems like a fairly small change, and it's in one very specific piece of the code, it's at the heart of running every event, so very thorough review is appreciated. I'm also happy to discuss the performance implications as well as the implications of the behaviour change. |
Will move back out of draft once I've investigated the charm failures - no point reviewing until then. |
if ( | ||
existing_observer_path != observer_path | ||
or existing_method_name != method_name | ||
or existing_event_path.split('[')[0] != event_path.split('[')[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through class Handle
, it seems that not all paths have a "key" index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but all events (from BoundEvent
's emit
) do have one, and that's the only type of object that has its path in the notices
table.
If the event_path
somehow was missing the key, then the code would also still work, because it's getting the 0th element so it will still be checking the full path - there just won't be a [1:]
that gets discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumption/logic deserves a comment, or maybe a unit test.
I'd be lost if I tried to eyeball this without your comment.
Code like this doesn't work with Harness: class MyCharm(ops.CharmBase):
...
def _on_config_changed(self, event):
if self.something():
event.defer()
self.do_something()
def test_config_changed():
h = testing.Harness()
h.begin()
MyCharm.something = lambda _: False
h.charm.on.config_changed.emit()
MyCharm.something = lambda _: True
h.charm.on.config_changed.emit()
assert MyCharm.do_something.called() # Assume this was patched somewhere The framework skips the second emit. Arguably, the test should clean up in some way because it's actually going to run the config-changed handler three times (once deferred, then twice when it doesn't defer, once for the deferred notice and once for the new one) - it could be broken into multiple tests or use a fresh Harness or could just call It seems likely (I'll do a quick check to be sure) that this pattern is too common to have the change in this way - I don't think we can expect people to update their tests to account for this, since it's not really a bug, just not the ideal way to write the test. Maybe the 'skip duplicate defers' will need to be explicitly enabled? |
I did a super-tox ( |
Docs: Ops: Add top-level intro, clean up ops.module intro in light of that, flesh out ops.pebble and ops.testing intros a bit more https://warthogs.atlassian.net/browse/CHARMTECH-183 - [x] "home page" library reference - [x] ops module - [x] ops.main entry point - [x] ops.pebble module - [x] ops.testing module
…names in ops.testing (canonical#1381) Add a new optional install `testing`, e.g. `pip install ops[testing]`. This pulls in a compatible version of `ops-scenario`, and exposes the Scenario names in the `ops.testing` namespace, alongside Harness. `pip install ops[harness]` is also supported to ease the (far off, presumably 3.0) transition to Harness being moved out of the base install. It currently installs no extra dependencies, so is the same as `pip install ops` but a forward-looking charm would use `pip install ops[harness]` (or `pip install ops[harness, testing]`) if using Harness. Requires ops-scenario 7.0.5, which has the required adjustments to support insertion into `ops.testing`. The `ActionFailed` name exists in both `ops._private.harness` and `scenario.context`. This is handled by adjusting the Harness class to support the functionality of both and monkeypatching that into Scenario until Scenario starts using it. It's compatible with both Harness and Scenario, but will have empty data in an attribute (which attribute depends on which framework is used). The `Container` name in `ops.testing`, which is only present for backwards compatibility, is also overwritten if ops-scenario is installed. If anyone is actually using `ops.testing.Container` instead of `ops.Container` then they'll need to fix their code before using `ops[testing]` (or ops-scenario combined with the release of ops with this change). A very basic unit test is added to make sure that Scenario tests work (all the actual Scenario tests are in the ops-scenario repo) if ops-scenario/ops[testing] is installed (this is the case for `tox -e unit`). A test is also added to ensure that all of the Scenario names are documented, since `automodule` isn't used any more. Also adjusts the documentation to include the new framework in ops.testing. --------- Co-authored-by: Ben Hoyt <[email protected]> Co-authored-by: Dima Tisnek <[email protected]>
e8d1ec8
to
1701f87
Compare
Ok, this should be good for an initial review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good to me, and it's a nice one to get fixed. Could we do a couple more tests before merging, just in case:
- You've called out the performance changes, but for a sanity check, could you please do a couple of measurements? What's a typical time for load the notices given and empty queue? And then if the queue is non-empty, what's the time to load each snapshot? I'm guessing sub-milliseconds in both cases, but would be good to confirm.
- I think trying to run super-tox
-e integration
will be too slow / problematic, but could we at least please spot checktox -e integration
on a couple of charms that use defer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I may not have thought this through.
Is it possible to get 2 distinct events from Juju that only differ in arguments?
For example, 2 secrets are now available, the secret id is different, but everything else is same (event name, observer function, human name if any).
except NoSnapshotError: | ||
existing_event_data = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code branch feels wrong.
If an event cannot be loaded, should it not be discarded instead?
Or do we have saved snapshots with no data?
if ( | ||
existing_observer_path != observer_path | ||
or existing_method_name != method_name | ||
or existing_event_path.split('[')[0] != event_path.split('[')[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumption/logic deserves a comment, or maybe a unit test.
I'd be lost if I tried to eyeball this without your comment.
# Emit an event, which will be deferred. | ||
pub.d.emit('foo') | ||
notices = tuple(framework._storage.notices()) | ||
assert len(notices) == 2 # One per observer. | ||
assert framework._storage.load_snapshot(notices[0][0]) == {'data': 'foo'} | ||
|
||
# Emit the same event, and we'll still just have the single notice. | ||
pub.d.emit('foo') | ||
assert len(tuple(framework._storage.notices())) == 2 | ||
|
||
# Emit the same event kind but with a different snapshot, and we'll get a new notice. | ||
pub.d.emit('bar') | ||
notices = tuple(framework._storage.notices()) | ||
assert len(notices) == 4 | ||
assert framework._storage.load_snapshot(notices[2][0]) == {'data': 'bar'} | ||
|
||
# Emit a totally different event, and we'll get a new notice. | ||
pub.n.emit() | ||
notices = tuple(framework._storage.notices()) | ||
assert len(notices) == 6 | ||
assert framework._storage.load_snapshot(notices[2][0]) == {'data': 'bar'} | ||
assert framework._storage.load_snapshot(notices[4][0]) == {} | ||
|
||
# Even though these events are far back in the queue, since they're | ||
# duplicates, they will get skipped. | ||
pub.d.emit('foo') | ||
pub.d.emit('bar') | ||
pub.n.emit() | ||
assert len(tuple(framework._storage.notices())) == 6 | ||
|
||
def notices_for_observer(n: int): | ||
return [ | ||
notice for notice in framework._storage.notices() if notice[1].endswith(f'[{n}]') | ||
] | ||
|
||
# Stop deferring on the first observer, and all those events will be | ||
# completed and the notices removed, while the second observer will | ||
# still have them queued. | ||
pub.r.emit() | ||
assert len(tuple(framework._storage.notices())) == 6 | ||
pub.n.emit() | ||
framework.reemit() | ||
assert len(notices_for_observer(1)) == 0 | ||
assert len(notices_for_observer(2)) == 3 | ||
|
||
# Without the defer active, the first observer always ends up with an | ||
# empty queue, while the second observer's queue continues to skip | ||
# duplicates and add new events. | ||
pub.d.emit('foo') | ||
pub.d.emit('foo') | ||
pub.d.emit('bar') | ||
pub.n.emit() | ||
pub.d.emit('foo') | ||
pub.d.emit('bar') | ||
pub.n.emit() | ||
pub.d.emit('baz') | ||
framework.reemit() | ||
assert len(notices_for_observer(1)) == 0 | ||
assert len(notices_for_observer(2)) == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider refactoring the setup and splitting this flow into separate tests.
When a new event is emitted, check if there is already an exact duplicate (the same observer, same event type, and same event snapshot) in the storage from a deferral in a previous run. If a duplicate does exist, then don't store a new notice and snapshot.
There are performance implications:
secret-changed
events but for different secrets) then there is a cost to doing the comparison.There is also a (deliberate) behaviour change that does impact event ordering. For example, consider this sequence of events:
config-changed
deferredsecret-changed
deferredconfig-changed
Currently, this would result in:
ConfigChangedEvent
ConfigChangedEvent
,SecretChangedEvent
ConfigChangedEvent
,SecretChangedEvent
,ConfigChangedEvent
With this change, this would result in:
ConfigChangedEvent
ConfigChangedEvent
,SecretChangedEvent
ConfigChangedEvent
,SecretChangedEvent
More generally, there could currently be duplicate (notice+snapshot) events intermixed throughout the queue, and each run they will be re-emitted in the order in which they originally occurred. With this change, they will be re-emitted in the order in which they originally occurred, except if they have already emitted this run. The particularly noticeable change is that the Juju event that triggered the run may not be the last event (if it was a duplicate of one in the queue).
We could potentially do this differently - for example, updating the sequence so that when a duplicate occurs it moves the event to the end of the queue (by dropping and adding the notice+snapshot, or explicitly setting the
sequence
field for SQL and by just reordering the list for Juju). This would add complexity and have a performance penalty, however, and it seems more correct to have the original order.For unit tests:
reemit()
call like there would be in production, and the test expects the handler to be called more than once. For this reason, the skipping is disabled for Harness.ctx.run
ing then you need to manage that in the list of deferred events passed into the State. We could add a consistency check to alert if there are duplicates in that list (this would be easier to do when the Scenario code is in this repo). However, the Scenario behaviour does still change: if the event is deferred in thectx.run
and there's already a match in the state's deferred list, then the new (deferred) event does not get added, and the output state doesn't change (which is what we want). We get this behaviour automatically because Scenario mimics the runtime behaviour more closely, actually running the framework emitting/reemitting. So: Scenario tests are both safer, and can be used to match the new behaviour.This can be tested manually with a charm that optionally defers events, e.g. with the code below.
charm.py and charmcraft.yaml content
If you want to see the queue while you're doing this, you can use code like this:
If you set
opt
to anything not starting with "defer" then you should get aconfig-changed
event every time. If you set it to something starting with "defer", then it will run exactly once each time you set the config (remember to change the value, or Juju will skip the event) - with ops@main you'll instead get aconfig-changed
event for every time that you change the config, every time you change it (ie. the queue will build up).You can also check it with an event that has a snapshot, such as
secret-changed
. If the config is set to defer, every time you change the content of the first secret, you'll get onesecret-changed
event (but with ops@main, each time you'll get multiple, depending on how many times you've done it). If you also change the second secret, you'll get twosecret-changed
events, one for each secret (because the snapshots differ).You can intermix the different events and should always have exactly zero or one (event type+snapshot) in the queue. If you change the
opt
value back to something not starting with "defer", then you should see all the events complete and have an empty queue.Scenario test that shows the behaviour
Note that this requires fixing a small bug in Secret snapshoting (PR). That's unrelated to this change - it also occurs if you use deferred secrets in main.
Fixes #935