-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blank node labels may leak information #60
Comments
That's a strong assumption. What if, instead, for example, there is an equal likelihood that any of
That may be true but may also not have any practical significance. Perhaps a better example would have used a boolean value. There was another important assumption in the example, which was that there could not have been any other quads or blank nodes in the dataset. If there's actually a need to create a mitigation here, one avenue is the introduction of other (potential randomized) quads. |
With respect to using LDP BBS+ with VCs (perhaps its most common -- or only? -- use) is that at least one other blank node (from the VC itself) would be involved. Some analysis should be performed to understand how this affects the (potential) issue. |
Yes, you are right. I do think that it depends on the example, though. For the example I presented I have checked that ‘’Kaiya’’ is indeed the only name (from a list of ~19000 English names) for which this inequality holds. For other examples, it could be anything (like total_names/2 or 20) depending on the credential and the disclosed claims.
If I understand your point correctly, it is true that other quads or blank nodes could affect the results. However, I don’t think that they necessarily prevent a leak (which also depends on the disclosed values). For example, the Holder could have the following credential (with multiple quads per blank node),
Here there are 2 blank nodes, each corresponding to a Holder's child (the credential and the credentialSubject have an ID, meaning they would not appear as blank nodes if I’m not mistaken). If the Holder wants to disclose the name and age of only their oldest child (“Ioanna”), the same method could be used to leak information about the un-disclosed claims (name and age of their youngest child, “Kale”). If the adversary knows that the Holder has two kids, they could reveal the hidden claims completely (at least according to my testing’s).
If in the end a mitigation is needed, i do think randomized quads could work. Maybe they could be “added” to each blank node during issuance? Would the Holder receive them as part of the JSON-LD?
In my initial example, one of the blank nodes is indeed the credential itself (which gets the label _:c14n1 in the end). I think that again it depends on the use case. |
Right, and as mentioned above, I'd think there could be examples where the possible values were from a much smaller set (e.g., a boolean or a simple enum) where a hash reordering could negatively impact the privacy of more people.
My understanding is that for BBS+ credentials, both the
If I'm correct about the |
Sorry when I originally read this issue I could not follow your example, I understand now, this is due to the blank node allocation algorithm. Further analysis should be performed on this scenario as on first glance it appears to depend on quite a few conditions being true:
Thats correct
This would appear to mitigate the risk of using the blank node allocation algorithm to deduce information about un-revealed claims. |
@tplooker I may be terribly mistaken, but I do believe that even in cases where those 2 assumptions do not hold, there could be a potential problem. The second example I previously listed perhaps showcases this.
Here, even though the Holder keeps secret both the
Yes, that is very true. It could be just a matter of how often those assumptions hold. However, I do believe that this kind of information would not be hard to obtain, at least in some cases. This could be the result either of the |
@dlongley thank you, I do believe I have a better understanding of your proposal now. The only occasion I can think of where additional “randomness” may be needed, is in the -perhaps rare- cases where there are a lot of nested values forming a larger “blank nodes path”. There, the random quad added to the initial blank node (the one corresponding to the
The reference set of the blank node corresponding to the nested object |
@BasileiosKal @dlongley I've implemented the proposed solution as discussed above over here. Basically during signing/verifying and deriving of a proof the nonce from the proof block gets injected into the root of the document as a source of randomness which effects the blank node allocation. However the nonce is never revealed to the verifier when a derived proof is generated.
@dlongley perhaps you can comment more here? from a quick trial on JSON-LD playground it did seem to effect the allocation. |
I suggest this go in privacy considerations section, and we not add a pre processing requirement, since it increases complexity. |
This may make sense if rdf-canon ends up not allowing for the needed flexibility, however IMO this is not suited for the considerations section. Privacy/security considerations are things that are easily avoidable, not required or that an implementer should look out for, so they will not unintentionally compromise security. This issue does not seem to fall in any of these categories. It's fine to choose to not address the issue, but it should be very clear from the beginning of the document what properties the algorithm has. For example, claiming zk-selective-disclosure would be inappropriate. |
Yes, I am suggesting that be stated in the security considerations. To be clear, this is my proposal:
3.1 PII in the subject, predicate, object or named graph component of the quad. We should make it clear that using bbs on data that inherently leaks relationships has security and privacy implications, and that these are unavoidable when nquads are used directly. After this is done, we should consider if there is a better solution if there is time remaining in the wg (which seems unlikely, given current contribution to the document). |
Lets be honest about the default behavior, not hold up the work, waiting for solutions that might never come. |
There are already solutions for the selective disclosure case that should be applied in the spec. The issuer should generate an HMAC key, and HMAC all This produces stronger privacy for selective disclosure than any other method that relies on structural information for disclosure (such as array index positions). I don't think we need to wait on this -- it just needs to be implemented. |
I've commented on the unlinkability vs. data leakage trade off here if we would like to mine it for the privacy considerations section: w3c/rdf-canon#84 (comment) |
I'm currently a -1 to doing this, but it is possible that if I could review an implementation or description of this, I would change my mind... It seems like "wishful thinking" in similar category to "linked secrets", I prefer we cover the basics before discussing this.
100% agree with this, but I am not planning to implement it, if I cannot read it in the current spec. I question adding something to the spec that does not have multiple independent implementations... or at least 1 pretty strong one... where we believe there will quickly follow a second. @dlongley are you saying we should only define the hmac transformed version in the current spec? what is the timeline for documenting that approach, and is there an implementation of it that I can review? |
For me its +1. The solution is as simple as using HMAC (or any keyd hash) in place of sha256 in URDNA2015. We can then describe the linkability problems in privacy considerations. I do agree that we should have a concrete proposal before resolving this though. I would be happy to help documenting the solution together with the proper considerations. As far as timelines go, I can try finding some time next week. |
Doing it this way still leaks the number of blank nodes via the labels. Instead, replacing the output canonical blank node labels with their HMAC'd values minimizes data leakage. IMO, a simple sweep over the abstract output from rdf-canon to replace these labels prior to serialization to canonical N-Quads is the best approach. |
I think that's reasonable.
Understood.
I think there are two use cases:
I'm strongly in favor of supporting the first case -- and it requires the hmac transformed version. So I'm +1 for at least defining that version. I don't feel strongly about the second case. If other people want the unlinkability version (for which I think the use cases are scant at this time and perhaps even flawed), we can define the untransformed version as well and tell people to minimize the number of blank nodes to prevent data leakage -- which is required anyway to limit linkability via the data revealed. But the utility of this, to me, seems far less than selective disclosure w/minimized data leakage. We can define these sorts of things (hmac transformation, skolemized framing for filtered message matching, etc.), as generalized DI-SD primitives to be composed and built on top of, making implementations simpler and reusable. This sort of thing is being worked on as quickly as it can be as inputs to this work. |
Ok, let's focus only on the hmac N-Quads transformer... Show me how it works. And I will see if I can get to being a +1 to adding it, but we will of course still need to convince the WG anyway. |
Here's some naive pseudo-code to replace
The We're also working in the RCH WG to enable the rdf-canon algorithm to return a normatively referenceable abstract representation of the canonized dataset -- where the blank node labels could be replaced there before serialization. This produces the same result but may be easier to write in normative spec text. Either way should work fine for implementations. |
@dlongley Thank you! I think I understand the proposal now. This for me is a -1 for the following reasons;
In the end, I dont see a reason to HMAC the URDNA result rather than replacing sha256 with it. Having implementation experience with that solution, it is simpler and more robust.
P.S., choosing between sd and unlinkability IMO is wrong. If I can be tracked at all times, then sd has no meaning. There's a reason the crypto layer has gone to great lengths to guarantee both properties. |
Yea, it will reveal a lower threshold for the number of bns but I see the point. However, the number of signed messages (i.e., the size of the rdf dataset) will be revealed either way, so I don't see the significance?? |
As I stated in my linked comment, I don't think there is any reasonable, general purpose mechanism today that supports arbitrary VC modeling coupled with both unlinkability and minimizing data leakage. In fact, without getting involved in both the modeling and the crypto layer, I suspect they are diametrically opposed goals. I don't think you can avoid significant VC modeling design constraints in the unlinkability case without losing data leakage minimization. The layers are mixed here; you can't "bolt on" unlinkability. Therefore, we should take two different approaches for each case. I believe that, for arbitrarily designed VCs, the best way to minimize data leakage with the tools we have today is what I proposed above. People that expect their VCs to be able to be selectively disclosed (but accept they are linkable) should use this mechanism. For unlinkability, if we even want to support it in this iteration, I think no hashing should be done at all and instead the number of blank nodes should be severely constrained or just fixed to some new, special existential-variable identifier scheme. After all, the goal is to avoid linking the data to anything anyway. Then, any VC that supports this MUST have a flat claim structure with no arrays. Perhaps a good rule of thumb is: If there's enough data to be leaked, there's enough data to be linked. The fact is that too much complexity in the data that is being signed itself will either break unlinkability or break leakage minimization by trying to be unlinkable. So putting effort into removing linkability in the securing mechanism itself is of little value without also requiring a limit to the complexity in the data being signed. This implies VC designers must know this and bake it into their designs for the unlinkable case. I think that's just a fact with the technology we have now -- and may even end up being provably true generally. They don't need to do this if unlinkability is not a goal (in fact, the exact opposite, linkability, is often a goal for these cases). Now, there are other efforts that may help with creating more complex, unlinkable VCs in the future by using "term-wise" signing, but those aren't ready on the timeline we need for this iteration of the BBS suite. None of the implementers here (that I'm aware of) has proposed using these techniques or tried to see if they suffer from similar problems. I think we need to move forward knowing that this iteration requires different approaches depending on the features we want.
It does not solve the data leak issue, as the number of blank nodes is leaked. When Alice reveals These things can be completely avoided by bifurcating VC designs into two different classes of VCs:
The first class can only be met by keeping the VC design extremely simple and minimal with unlinkability intentionally built in. Arbitrarily designed, interconnected webs of information will fail to fit into the first class, no matter what we do using today's cryptographic schemes. The VC designer MUST be involved here. The second class can be met by allowing arbitrary VC design but by sacrificing unlinkability. The use cases for VCs that fit into this class will almost always (if not always) require linkability via correlation in the revealed data anyway.
See the above for the reason. I don't think it's workable to have a single solution here.
It sounds like we have different opinions then :). I suspect it may be the case that "choosing between SD and unlinkability" could be formally proven to be a requirement depending on the complexity of the data. It's very similar to having to choose between a "perfectly hiding" or a "perfectly binding" commitment scheme; you can't have both. But, to work with the simpler argument here. I think SD has plenty of meaning when you have a VC that has:
And you just want to reveal your full name and zip code. That fact that you are highly identifiable by your full name plus your zip code does not mean that you want to reveal your SSN to everyone that also knows those pieces of information. This is either a clear and logical rejection of your statement above -- or I've misunderstood it.
Revealing the size of the RDF dataset doesn't necessarily mean that the number of blank nodes is revealed. It's also much simpler to address that at the crypto layer alone vs. getting involved at the data modeling layer. If the blank node labels themselves reveal no information about the number of blank nodes, the crypto layer can more easily make adjustments to the number of messages without concern for how the data is modeled regarding what leaks might occur. |
@dlongley thank you very much for you answer!
Just to try and clarify my position using that example! Imagine in a later time I am required to reveal my SSN and zip code, but i dont want to reveal my name. The fact I can be tracked at all times means that there is nothing stopping the Verifier(s) or any adversary (like some governments ;P) from deriving my name from my old presentations.
Just to make sure I understand. Your argument against using HMAC internally in URDNA is that it will leak information about the total number of bns? Because, worst case scenario, this can leak some information about the length of lists with nested objects. So to this my points are;
I don’t think it will be much easier to address these concerns in the crypto layer. If the issuer decides to include random messages while signing, then they will also need to communicate the On the other hand, using HMAC internally in URDNA is very easily implemented with the current libraries and does not require from developers any more knowledge about what happens internally. It also does not completely break unlinkability so in the end I’m sorry, but I still don’t understand why to not use it. |
Once PR w3c/vc-di-ecdsa#19 is merged, this issue can be addressed by adding text that either uses the output of |
Please see the updated document for mechanism details that address these concerns. See PR #101 for privacy considerations related to both data leakage and unlinkability, including analysis. |
To my understanding, the URDNA2015 canonicalization algorithm calculates the hashes of the reference sets of each blank node, it sorts them alphabetically and assigns labels in that order (if those hashes are unique). That order would be partially revealed to the Verifier (#10) and could leak information about the non-disclosed claims (even revealing the secret claims in some cases).
For example, if the holder has the following JSON-LD,
The reference sets would be,
And since h(Q1) > h(Q2), where "h" a hash function, the canonicalization result would be:
If the Holder wants to only disclose their postal address, they will send to the Verifier something like,
From that, the Verifier can also conclude that h(Q1) > h(Q2), where h(Q2) is known (all elements of Q2 are disclosed). As a result, they can obtain information about the secret claims although they could not calculate h(Q1) directly, since Q1 contains the un-disclosed claims.
For example, If the Verifier knows the claim that the Holder tries to hide is their first name, they could just try every English name to find the ones for which h(Q1) > h(Q2). Doing so, will reveal that the only name for which that inequality holds is the name “Kaiya”, revealing the Holders secret.
That example may be rare, it does however showcase how the order of the blank node labels leaks information. That information, even if it does not reveal the secret claims completely, it could still potentially “break” the zero-knowledge property.
The text was updated successfully, but these errors were encountered: