Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Anonlink binary format for serialisation #339

Merged
merged 2 commits into from
Mar 15, 2019
Merged

Conversation

nbgl
Copy link
Contributor

@nbgl nbgl commented Mar 13, 2019

NB: There is a bug in anonlink that causes the tests to fail. This is fixed in data61/anonlink#190, which must be followed by a release and a version bump in the requirements file.

@nbgl nbgl self-assigned this Mar 13, 2019
@nbgl nbgl requested a review from hardbyte March 13, 2019 04:13
Copy link
Collaborator

@hardbyte hardbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to ship once the deletion is dealt with. A few other comments inline.

csvwriter.writerows(partial_sparse_result)
bytes_iter, file_size \
= anonlink.serialization.dump_candidate_pairs_iter(chunk_results)
iter_stream = iterable_to_stream(bytes_iter)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - I like the removal of a write to local disk

generate_code(12))
merged_file_stream = iterable_to_stream(merged_file_iter)
try:
mc.put_object(Config.MINIO_BUCKET, merged_file_name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they partial merge files deleted after merge is complete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They’re deleted on line 257 :)

file0 = heapq.heappop(files)
file1 = heapq.heappop(files)
merged_file = _merge_files(mc, log, file0, file1)
heapq.heappush(files, merged_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps delete file0 and file1 at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are deleted on line 257.

assert dset_i1 == 1
mapping[str(rec_i0)] = str(rec_i1) # Strings make JSON happy.
mapping = {str(i): str(j)
for i, j in anonlink.solving.pairs_from_groups(groups)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can get rid of the cast to str - are the candidate_pairs loaded via anonlink.serialization.load_candidate_pairs(score_file) strings already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are integers when they are loaded by load_candidate_strings. greedy_solve requires an array of integers and it will not accept strings.

backend/entityservice/utils.py Show resolved Hide resolved
# Process the CSV into JSON
csv_text_stream = io.TextIOWrapper(csv_data_stream, encoding="utf-8")
sims_data_stream = mc.get_object(config.MINIO_BUCKET, filename)
# TODO: Below is an Anonlink 'private' API. It should be made
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please create an issue in anonlink for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nbgl
Copy link
Contributor Author

nbgl commented Mar 13, 2019

The file deletion is already there. Pls approve.

@nbgl nbgl requested a review from hardbyte March 13, 2019 05:37
@nbgl nbgl merged commit ffb2c7a into develop Mar 15, 2019
@nbgl nbgl deleted the feature-serialization branch March 15, 2019 11:36
@hardbyte hardbyte mentioned this pull request Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sort similarity scores
2 participants