Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coverage alignment heuristic #76

Merged
merged 10 commits into from
Aug 22, 2024
Merged

Coverage alignment heuristic #76

merged 10 commits into from
Aug 22, 2024

Conversation

SkBlaz
Copy link
Collaborator

@SkBlaz SkBlaz commented Aug 16, 2024

Adding a simple heuristic that enables computing proportion of aligned values. Makes pairwise comparisons breezy

https://jira.outbrain.com/browse/REF-51623

@SkBlaz SkBlaz requested a review from adischw August 16, 2024 19:11

def max_pair_coverage(array1: np.array, array2: np.array) -> float:
def hash_pair(el1, el2):
return el1 * 17 - el2
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is around 100% faster to hash(el1) someOp hash(el2) and is fine for this use case col wise



def max_pair_coverage(array1: npt.NDArray[np.int32], array2: npt.NDArray[np.int32]) -> float:
def hash_pair(el1: np.int32, el2: np.int32):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing output type hint :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add with improved pre-commit hooks, as that should have picked it up

import numpy.typing as npt

np.random.seed(123)
max_size = 10**6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq, can max_size be estimated depending on input vector?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but this is directly related to batch sizes, where 1mil is a very very safe bound (many things go wrong before this is reached)

@SkBlaz SkBlaz merged commit bed2095 into main Aug 22, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants