-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort similarity scores #335
Comments
I’ve disabled two tests in |
I think this may have been incorrectly closed by a comment in #339. The entity service doesn't do a merge sort on similarity scores, however the solving occurs in a single celery task (on a high memory machine) and the anonlink library sorts before solving. @nbgl is there any reason to still want the entity service to merge sort (other than if/when we have to support a parallel solver?) See: |
The greedy solver needs to consider scores from highest to lowest to maximise accuracy. Sorting before solving is the usual way of doing this (I describe a more efficient way in data61/anonlink#212, but it is more complex to implement). The Entity Service does actually sort similarity scores now, so this issue may be closed. The code is here as part of #339. A concern might be that this aggregation is single-threaded, so it might not be the most efficient. Parallelising this merge sort is not a research question, but a software engineering one. (Not hard, but annoying.) But this should be identified as a bottleneck before any work is done, and it should be a separate issue. |
Thanks for the clarification Jakub |
Similarity scores need to be sorted before solving. There are some helper tools in
anonlink.concurrency
for this.The text was updated successfully, but these errors were encountered: