Attack on Bloom filters built from multiple identifiers #40

wilko77 · 2017-09-26T06:22:17Z

In Who Is 1011011111…
1110110010? Automated Cryptanalysis of Bloom Filter Encryptions of Databases with Several Personal Identifiers Kroll and Steinmetzer present cryptanalysis and an attack on Bloom filters built from multiple identifiers.

Is that applicable to our use-case?
Discuss!

hardbyte · 2017-09-26T06:59:32Z

My first (brief) read through indicates that yes it is applicable. It looks like the slightly dubious double hash is actually introducing the weakness that they are able to exploit.

We need to look into this further. It referenced Cryptanalysis of Basic Bloom Filters Used for
Privacy Preserving Record Linkage and mentioned it might propose a solution.

hardbyte · 2017-10-04T23:31:44Z

Some good news, one of the proposed solutions is to use multiple identifiers in the same bloom filter (which we already do):

Regarding the double hash the same paper strongly suggests using k independent hash functions:

If we use a single strong hash with a counter up to k used as salt would that meet that criteria?
I'm thinking of just using sha256 instead of the linear combination of sha1 and md5 (still HMAC'd). The code is now in clkhash

Thoughts? @wilko77 @mhsiah @gusmith @sjhardy

hardbyte · 2017-10-10T00:27:48Z

Slightly different version of "Cryptanalysis of basic Bloom filters used for Privacy
Preserving Record Linkage" here

Question still stands.

hardbyte · 2017-10-14T06:08:06Z

Warning I feel like I'm making up crypto here... would like help.

The problem is to hash a token m into k integer locations (% l). We also have a shared key to use.
Here is my current proposal:

    for m in mlist:
        hm = hmac.new(key.encode(), digestmod=sha256)
        hm.update(m.encode())
        for i in range(k):
            hm.update(i.to_bytes(8, 'big'))
            mi_index = int.from_bytes(hm.digest(), 'big') % l
            bf[mi_index] = 1

gusmith · 2017-10-16T05:47:16Z

Notice: this is just thoughts, without any proofs.

We need to be sure that the used hash is at least known to be secure against known-plaintext attack.
In fact, with the current version, we will hash the following (represented in bytes):

${secret}0
${secret}01
${secret}012
...

So the attacker gets k ciphertexts of plaintexts from which he knows the suffix, and he knows that the prefix is always the same.
If the hashing method is secure against chosen-plaintext attack, I think we wouldn't have any problem.

unzvfu · 2017-10-17T01:05:45Z

I need to brush up on this stuff, but one potentially useful observation occurs to me: A hash h := SHA256(m) is 256 bits long, so for a filter of length L, perhaps we can use the first lg(L)~10 bits as the first hash function, the second lg(L) bits for the second hash function, and so on. I'm pretty sure substrings of hashes should be (linearly) independent of one another. This would allow us to get ~25 independent hash functions for the price of one actual call to SHA256.

unzvfu · 2017-10-17T02:27:19Z

A couple of random posts on StackOverflow suggest that using different seeds results in independent hash functions (i.e. initialise your hash function with each seed, then hash the bigram as usual). The two parties in the protocol will have to agree on the sequence (set + ordering) of seeds. It's not clear whether the seeds need to be secret, or whether they can be chosen to be a fixed set of values (e.g. 0, ..., k-1).

wilko77 · 2017-10-18T02:18:49Z

There is also the idea of using XOR-folding

wilko77 · 2017-10-20T02:41:07Z

Weaknesses in our approach that can be exploited (as described in those attack papers):

When creating the bi-grams, the first and last bi-gram are padded with a whitespace. This is a weakness, because it allows an attacker more easily to find the beginning and the end of a word.
=> We should investigate if dropping the padding decreases matching accuracy.
The double hashing scheme. One consequence of choosing this scheme is that all bits belonging to a particular n-gram will follow a sequence b_{n+1} = b_{n} + hash_2 mod l. This structure can be exploited when looking for "atoms" (the bloom filter generated by one n-gram).
If we use k independent hash functions instead, then there will be no particular structure in an atom.
Also, there is a flaw in the double hashing scheme: if hash_2 mod l = 0, then only one bit will be set in the bloom filter irrespective of the value k (the probability of this happening is 1/l = 1/1024 !!!).

hardbyte · 2018-01-09T03:18:46Z

I think we might be able to wrap this issue up. Opening follow up issues to track any loose threads.

I've created issue Investigate effect of not including start/end bigrams #54 to look into weakness with hashing whitespare at the start/end of features.
Issue Flaw in double hash encoding of n-grams clkhash#33 looks into the double hashing flaw
clkhash has been modified to use KDF to ensure independant hashes are used.
Issue Consider supporting XOR-Folding clkhash#32 discussing looking into xor-folding
I'll delete my experimental branch with an alternative to the double hash scheme

@wilko77 can you look over the above and make sure I've captured everything before closing?

wilko77 · 2018-01-30T05:11:21Z

There are two separate issues with independence:

We reused the same keys for every identifier. This means, that an n-gram, let's say bo, in the identifier first name gets mapped to the same bit position in the Bloom filter as the n-gram bo in last name. Remedy here is to generate different keys for each identifier. This has been addressed in Feature different keys for each identifier clkhash#23.
As pointed out in the 'Who is 1011...' paper, the double hash encoding scheme of Schnell et. al. can be exploited to mount an attack. There is a proposal to address this in Feature blake encoding clkhash#40.

hardbyte added security state: Follow Up labels Sep 26, 2017

hardbyte added the P0: critical label Oct 11, 2017

hardbyte assigned wilko77 Jan 9, 2018

hardbyte added this to the Sprint 2018-01-29 milestone Jan 24, 2018

wilko77 closed this as completed Jan 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attack on Bloom filters built from multiple identifiers #40

Attack on Bloom filters built from multiple identifiers #40

wilko77 commented Sep 26, 2017

hardbyte commented Sep 26, 2017 •

edited

Loading

hardbyte commented Oct 4, 2017 •

edited

Loading

hardbyte commented Oct 10, 2017 •

edited

Loading

hardbyte commented Oct 14, 2017 •

edited

Loading

gusmith commented Oct 16, 2017

unzvfu commented Oct 17, 2017

unzvfu commented Oct 17, 2017 •

edited

Loading

wilko77 commented Oct 18, 2017

wilko77 commented Oct 20, 2017

hardbyte commented Jan 9, 2018

wilko77 commented Jan 30, 2018

Attack on Bloom filters built from multiple identifiers #40

Attack on Bloom filters built from multiple identifiers #40

Comments

wilko77 commented Sep 26, 2017

hardbyte commented Sep 26, 2017 • edited Loading

hardbyte commented Oct 4, 2017 • edited Loading

hardbyte commented Oct 10, 2017 • edited Loading

hardbyte commented Oct 14, 2017 • edited Loading

gusmith commented Oct 16, 2017

unzvfu commented Oct 17, 2017

unzvfu commented Oct 17, 2017 • edited Loading

wilko77 commented Oct 18, 2017

wilko77 commented Oct 20, 2017

hardbyte commented Jan 9, 2018

wilko77 commented Jan 30, 2018

hardbyte commented Sep 26, 2017 •

edited

Loading

hardbyte commented Oct 4, 2017 •

edited

Loading

hardbyte commented Oct 10, 2017 •

edited

Loading

hardbyte commented Oct 14, 2017 •

edited

Loading

unzvfu commented Oct 17, 2017 •

edited

Loading