Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle blocking input that doesn't cover all encodings #627

Open
hardbyte opened this issue Mar 7, 2021 · 1 comment
Open

Handle blocking input that doesn't cover all encodings #627

hardbyte opened this issue Mar 7, 2021 · 1 comment

Comments

@hardbyte
Copy link
Collaborator

hardbyte commented Mar 7, 2021

blocklib filters out records according to the blocking specification. It warns if all records are not included in a block after using a particular blocking schema, but as someone else (another party) may have produced the blocking schema it seems reasonable that the Anonlink service should accept the blocking data as given (even if it doesn't cover 100% of the records). Possibly the service should warn that not all records are covered, or do something else (put the strays into their own block?).

An alternative is clients could filter out the records that are not part of any block and not upload the CLK encodings for those records.

As an example here is a (terrible) blocking schema for the febrl 4 dataset that excludes a few records:

blocking_schema = {
        "type": "p-sig",
        "version": 1,
        "config": {
            "blocking-features": ['given_name', 'surname'],
            "filter": {
                "type": "ratio",
                "max": 0.1,
                "min": 0.01,
            },
            "blocking-filter": {
                "type": "bloom filter",
                "number-hash-functions": 10,
                "bf-len": 2048,
            },
            "signatureSpecs": [
                [
                    {"type": "characters-at", "feature": "given_name", "config": {"pos": [0]}},
                ],
                [
                    {"type": "characters-at", "feature": "surname", "config": {"pos": [0]}},
                ]
            ]
        }
    }

Blocklib notes that this could be an issue:

P-Sig: Warning! only 96.42% records are covered in blocks. Please consider to improve signatures
Statistics for the generated blocks:
	Number of Blocks:   37
	Minimum Block Size: 60
	Maximum Block Size: 475
	Average Block Size: 217.40540540540542
	Median Block Size:  207
	Standard Deviation of Block Size:  123.52293072306216
P-Sig: Warning! only 97.1% records are covered in blocks. Please consider to improve signatures
Statistics for the generated blocks:
	Number of Blocks:   39
	Minimum Block Size: 52
	Maximum Block Size: 456
	Average Block Size: 210.17948717948718
	Median Block Size:  193
	Standard Deviation of Block Size:  113.03838933250947

The Anonlink service then fails while importing the encodings:

2021-03-08 11:44:31 | File "/var/www/entityservice/tasks/encoding_uploading.py", line 77, in pull_external_data
-- | --
2021-03-08 11:44:31 | assert count == len(encoding_to_block_map), f"Expected {count} encodings in blocks got {len(encoding_to_block_map)}"
2021-03-08 11:44:31 | AssertionError: Expected 5000 encodings in blocks got 4982
2021-03-08 11:44:31 | [2021-03-07 22:44:31,870: ERROR/ForkPoolWorker-2] Task entityservice.tasks.encoding_uploading.pull_external_data[005bf363-1176-40cd-a5c6-3c9f27f18bb0] raised unexpected: AssertionError('Expected 5000 encodings in blocks got 4982')
@wilko77
Copy link
Collaborator

wilko77 commented Mar 10, 2021

I agree. It is not up to the entity service to decide if this is a good thing to do or not.
As the data provider have to agree on a blocking scheme beforehand and they get the coverage information from blocklib, they should decide together, if they want to proceed or not.

I am not a big fan of the filtering idea, as that destroys the alignment of the indices of the clks with the corresponding PII. Then you would have to keep a mapping of clk indices, as returned by the server to the local PII indices.

Thus, I vote for taming the server. Execute the run, irrespective of the coverage of the blocks. Maybe provide a warning to the analyst.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants