Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(utd_hook): Fix regression causing retry to report false late decrypt #4252

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

BillCarsonFr
Copy link
Member

@BillCarsonFr BillCarsonFr commented Nov 12, 2024

There has been a recent change on Decryptor#decrypt_event_impl causing the function to return an TimelineEvent of kind unable to decrypt instead of failing with an error.

The late_decrypt detection code was not changed, causing any retry to mark UTDs as late decrypt.

  • Public API changes documented in changelogs (optional)

Signed-off-by:

Copy link

codecov bot commented Nov 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.92%. Comparing base (9dd2d5e) to head (24ef084).
Report is 37 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4252      +/-   ##
==========================================
- Coverage   84.95%   84.92%   -0.03%     
==========================================
  Files         274      274              
  Lines       29768    29769       +1     
==========================================
- Hits        25288    25282       -6     
- Misses       4480     4487       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@BillCarsonFr BillCarsonFr marked this pull request as ready for review November 12, 2024 17:21
@BillCarsonFr BillCarsonFr requested a review from a team as a code owner November 12, 2024 17:21
@richvdh
Copy link
Member

richvdh commented Nov 12, 2024

According to @BillCarsonFr: this bug should have caused the UTD rate reported to Posthog to (incorrectly) drop significantly. We haven't actually observed that -- possibly because the relevant change f4e8222 (#4070) hasn't rolled out very far yet?

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks correct to me; sorry for missing this before.

It would be good if we had a regression test.

None
} else {
trace!(
kind = ?event.kind,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably a bit verbose?

Copy link
Member Author

@BillCarsonFr BillCarsonFr Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not changed by this PR. Any other reason to remove it? is it spamming the RS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contains most of the event's data that's not privacy-sensitive, so it will be spammy. This line can be reached multiple times for each undecrypted event that failed to re-decrypt again, so it will be super spammy, please remove it.

@BillCarsonFr
Copy link
Member Author

BillCarsonFr commented Nov 13, 2024

According to @BillCarsonFr: this bug should have caused the UTD rate reported to Posthog to (incorrectly) drop significantly. We haven't actually observed that -- possibly because the relevant change f4e8222 (#4070) hasn't rolled out very far yet?

Actually there is no grace period on rust sdk UTD reporting like on other platform; On other platforms, an event decrypted in less than 4s is graced out and not reported. It is not the case in rust sdk, so the UTD would be reported immediatly but as a temporary UTD (with timeToDecryptMillis set to a value).
So we shouldn't see a drop in posthog. What could be expected is a raise of temporary UTD.

@BillCarsonFr
Copy link
Member Author

This change looks correct to me; sorry for missing this before.

It would be good if we had a regression test.

Added a regression test

// Notify observers that we managed to eventually decrypt an event.
if let Some(hook) = unable_to_decrypt_hook {
hook.on_late_decrypt(&remote_event.event_id, *utd_cause).await;
if let matrix_sdk::deserialized_responses::TimelineEventKind::UnableToDecrypt { utd_info, ..} = event.kind {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use use, not the fully qualified name here

None
} else {
trace!(
kind = ?event.kind,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contains most of the event's data that's not privacy-sensitive, so it will be spammy. This line can be reached multiple times for each undecrypted event that failed to re-decrypt again, so it will be super spammy, please remove it.

{
let utds = hook.utds.lock().unwrap();
assert_eq!(utds.len(), 1);
// Without the regression fix this would be Some(time to decrypt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Without the regression fix this would be Some(time to decrypt)
// This is the main thing we're testing: if this wasn't identified as a definite UTD, this would be `Some(..)`.

Comment on lines +305 to +313
match olm_machine
.try_decrypt_room_event(raw.cast_ref(), room_id, &decryption_settings)
.await?
{
RoomEventDecryptionResult::Decrypted(decrypted) => Ok(decrypted.into()),
RoomEventDecryptionResult::UnableToDecrypt(utd_info) => {
Ok(TimelineEvent::new_utd_event(raw.clone(), utd_info))
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed, out of curiosity? This is for the test timeline; is it operating that differently from the non-test timeline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed to make the regression test ~"fails correctly".
If not the test never detects the behavior change because it is using the mock Decryptor that continues to fail with an error when there is an UTD instead of return a OK UTDEvent type.

The underlying problem is that we use a big catch everything Error for all possible errors of the SDK. And here the change that created the regression should have resulted in a change of signature of the trait. It is not anymore Result<, ALLERRORS> but a Result <,SubsetOfErrors>?

@bnjbvr
Copy link
Member

bnjbvr commented Nov 18, 2024

(Great catch and thanks for fixing, btw)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants