Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blank node labels may leak information #60

Closed
BasileiosKal opened this issue Jul 27, 2021 · 27 comments
Closed

Blank node labels may leak information #60

BasileiosKal opened this issue Jul 27, 2021 · 27 comments
Assignees
Labels
pending-close The issue will be closed in 7 days if there are no objections.

Comments

@BasileiosKal
Copy link

To my understanding, the URDNA2015 canonicalization algorithm calculates the hashes of the reference sets of each blank node, it sorts them alphabetically and assigns labels in that order (if those hashes are unique). That order would be partially revealed to the Verifier (#10) and could leak information about the non-disclosed claims (even revealing the secret claims in some cases).

For example, if the holder has the following JSON-LD,

{"@context": ["http://schema.org/",
              "https://w3id.org/security/v2"],
 "firstName": "Kaiya",
 "address": {
    "postalAddress": "6096"}
}

The reference sets would be,

Q1 = {<_: a, http://schema.org/firstName, “Kaiya”>, <_: a, http://schema.org/address, _: z>}
Q2 = {<_: z, http://schema.org/address, _: a>, <_: a, http://schema.org/postalAddress, “6096”>}

And since h(Q1) > h(Q2), where "h" a hash function, the canonicalization result would be:

_:c14n0 <http://schema.org/postalAddress> "6096" .
_:c14n1 <http://schema.org/address> _:c14n0 .
_:c14n1 <http://schema.org/firstName> "Kaiya" .

If the Holder wants to only disclose their postal address, they will send to the Verifier something like,

{"@context": ["http://schema.org/",
              "https://w3id.org/security/v2"],
 "address": {
    "id": "urn:bnid:_:c14n0",
    "postalAddress": "6096"}
}

From that, the Verifier can also conclude that h(Q1) > h(Q2), where h(Q2) is known (all elements of Q2 are disclosed). As a result, they can obtain information about the secret claims although they could not calculate h(Q1) directly, since Q1 contains the un-disclosed claims.

For example, If the Verifier knows the claim that the Holder tries to hide is their first name, they could just try every English name to find the ones for which h(Q1) > h(Q2). Doing so, will reveal that the only name for which that inequality holds is the name “Kaiya”, revealing the Holders secret.

That example may be rare, it does however showcase how the order of the blank node labels leaks information. That information, even if it does not reveal the secret claims completely, it could still potentially “break” the zero-knowledge property.

@dlongley
Copy link
Contributor

Doing so, will reveal that the only name for which that inequality holds is the name “Kaiya”, revealing the Holders secret.

That's a strong assumption. What if, instead, for example, there is an equal likelihood that any of (total names in the world) / 2 could have been used?

That information, even if it does not reveal the secret claims completely, it could still potentially “break” the zero-knowledge property.

That may be true but may also not have any practical significance. Perhaps a better example would have used a boolean value.

There was another important assumption in the example, which was that there could not have been any other quads or blank nodes in the dataset. If there's actually a need to create a mitigation here, one avenue is the introduction of other (potential randomized) quads.

@dlongley
Copy link
Contributor

With respect to using LDP BBS+ with VCs (perhaps its most common -- or only? -- use) is that at least one other blank node (from the VC itself) would be involved. Some analysis should be performed to understand how this affects the (potential) issue.

@BasileiosKal
Copy link
Author

That's a strong assumption. What if, instead, for example, there is an equal likelihood that any of (total names in the world) / 2 could have been used?

Yes, you are right. I do think that it depends on the example, though. For the example I presented I have checked that ‘’Kaiya’’ is indeed the only name (from a list of ~19000 English names) for which this inequality holds. For other examples, it could be anything (like total_names/2 or 20) depending on the credential and the disclosed claims.

There was another important assumption in the example, which was that there could not have been any other quads or blank nodes in the dataset.

If I understand your point correctly, it is true that other quads or blank nodes could affect the results. However, I don’t think that they necessarily prevent a leak (which also depends on the disclosed values). For example, the Holder could have the following credential (with multiple quads per blank node),

{
  "@context": [
    "http://schema.org/",
    "https://w3id.org/security/v2"],
  "type": ["VerifiableCredential"],
  "issuer": "did:example:489398593",
  "id": "https://issuer.com/credentials/123456",
  "credentialSubject": {
      "id": "did:example:654321",
      "givenName": "Jane",
      "familyName": "Doe",
      "email": "[email protected]",
      "kids": [
          {"givenName": "Kale",
           "age": "9"},
          {"givenName": "Ioanna",
           "age": "28"}]}
}

Here there are 2 blank nodes, each corresponding to a Holder's child (the credential and the credentialSubject have an ID, meaning they would not appear as blank nodes if I’m not mistaken). If the Holder wants to disclose the name and age of only their oldest child (“Ioanna”), the same method could be used to leak information about the un-disclosed claims (name and age of their youngest child, “Kale”). If the adversary knows that the Holder has two kids, they could reveal the hidden claims completely (at least according to my testing’s).

If there's actually a need to create a mitigation here, one avenue is the introduction of other (potential randomized) quads.

If in the end a mitigation is needed, i do think randomized quads could work. Maybe they could be “added” to each blank node during issuance? Would the Holder receive them as part of the JSON-LD?

With respect to using LDP BBS+ with VCs (perhaps its most common -- or only? -- use) is that at least one other blank node (from the VC itself) would be involved. Some analysis should be performed to understand how this affects the (potential) issue.

In my initial example, one of the blank nodes is indeed the credential itself (which gets the label _:c14n1 in the end). I think that again it depends on the use case.

@dlongley
Copy link
Contributor

dlongley commented Jul 28, 2021

Yes, you are right. I do think that it depends on the example, though. For the example I presented I have checked that ‘’Kaiya’’ is indeed the only name (from a list of ~19000 English names) for which this inequality holds. For other examples, it could be anything (like total_names/2 or 20) depending on the credential and the disclosed claims.

Right, and as mentioned above, I'd think there could be examples where the possible values were from a much smaller set (e.g., a boolean or a simple enum) where a hash reordering could negatively impact the privacy of more people.

Here there are 2 blank nodes, each corresponding to a Holder's child (the credential and the credentialSubject have an ID, meaning they would not appear as blank nodes if I’m not mistaken).

My understanding is that for BBS+ credentials, both the credential itself and the credentialSubject would not have IDs, meaning they would be blank nodes. I think (unlinkable) selective disclosure (using this scheme) would be impossible if this were not true.

If in the end a mitigation is needed, i do think randomized quads could work. Maybe they could be “added” to each blank node during issuance? Would the Holder receive them as part of the JSON-LD?

If I'm correct about the credential itself using a blank node as an identifier, I'd think that a mitigation here may be the simple addition of a random value for some quad about the credential itself. Further analysis is required to determine if that would be sufficient. But, yes, this random value could just be added to credential at issuance and it would appear in the JSON-LD VC the holder receives. The quad would be essentially meaningless (and never disclosed) when presenting.

@tplooker
Copy link
Contributor

Sorry when I originally read this issue I could not follow your example, I understand now, this is due to the blank node allocation algorithm. Further analysis should be performed on this scenario as on first glance it appears to depend on quite a few conditions being true:

  1. No other blank nodes in the originally signed payload that are un-revealed.
  2. No other claims in the target blank node object that are un-revealed.
  3. Knowledge of the claims name that is un-revealed (e.g knowing that firstName was in the originally signed credential).
  4. Knowledge of the enumerated set of values that the claim value could possibly be

My understanding is that for BBS+ credentials, both the credential itself and the credentialSubject would not have IDs, meaning they would be blank nodes. I think (unlinkable) selective disclosure (using this scheme) would be impossible if this were not true.

Thats correct

If I'm correct about the credential itself using a blank node as an identifier, I'd think that a mitigation here may be the simple addition of a random value for some quad about the credential itself. Further analysis is required to determine if that would be sufficient. But, yes, this random value could just be added to credential at issuance and it would appear in the JSON-LD VC the holder receives. The quad would be essentially meaningless (and never disclosed) when presenting.

This would appear to mitigate the risk of using the blank node allocation algorithm to deduce information about un-revealed claims.

@BasileiosKal
Copy link
Author

  1. No other blank nodes in the originally signed payload that are un-revealed.
  2. No other claims in the target blank node object that are un-revealed.

@tplooker I may be terribly mistaken, but I do believe that even in cases where those 2 assumptions do not hold, there could be a potential problem. The second example I previously listed perhaps showcases this.

{
  "@context": [
    "http://schema.org/",
    "https://w3id.org/security/v2"],
  "type": ["VerifiableCredential"],
  "issuer": "did:example:489398593",
  "id": "https://issuer.com/credentials/123456",
  "credentialSubject": {
      "id": "did:example:654321",
      "givenName": "Jane",
      "familyName": "Doe",
      "email": "[email protected]",
      "kids": [
          {"givenName": "Kale",
           "age": "9"},
          {"givenName": "Ioanna",
           "age": "28"}]}
}

Here, even though the Holder keeps secret both the givenName and age of their first child (values in the same blank node), the Verifier could discover them following that same method (assuming the 3rd and 4th assumptions you listed are true). If my understanding is correct, even keeping secret other values (from other nodes) like their email and givenName would not change much, since those claims do not affect the reference set of the target blank node (which -at least in theory- essentially causes the leak).

  1. Knowledge of the claims name that is un-revealed (e.g knowing that firstName was in the originally signed credential).
  2. Knowledge of the enumerated set of values that the claim value could possibly be

Yes, that is very true. It could be just a matter of how often those assumptions hold. However, I do believe that this kind of information would not be hard to obtain, at least in some cases. This could be the result either of the credential's core being known or of the use of the claims indices (e.g. #29).

@BasileiosKal
Copy link
Author

If I'm correct about the credential itself using a blank node as an identifier, I'd think that a mitigation here may be the simple addition of a random value for some quad about the credential itself. Further analysis is required to determine if that would be sufficient. But, yes, this random value could just be added to credential at issuance and it would appear in the JSON-LD VC the holder receives. The quad would be essentially meaningless (and never disclosed) when presenting.

@dlongley thank you, I do believe I have a better understanding of your proposal now. The only occasion I can think of where additional “randomness” may be needed, is in the -perhaps rare- cases where there are a lot of nested values forming a larger “blank nodes path”. There, the random quad added to the initial blank node (the one corresponding to the credential) would not affect the later ones. An example could look like the following,

{
  "@context": [
    "http://schema.org/",
    "https://w3id.org/security/v2"],
   "issuer": "did:example:12345678",
   "random_quad_1" : "1238nvfs925njasdfl4",
   "credentialSubject": {
        "key_1":{
            "key_2":[
                {"key_3": "A"},
                {"key_4": "B"}]
        }
 }

The reference set of the blank node corresponding to the nested object {"key_3": "A"} would not be affected by a random quad “linked” with the blank node of the credential itself (like random_quad_1 in the example above). However, if that does prove to be a problem, I think it could be solved by just adding some extra random quads where needed (maybe every second nested object?).

@tplooker
Copy link
Contributor

tplooker commented Aug 5, 2021

@BasileiosKal @dlongley I've implemented the proposed solution as discussed above over here. Basically during signing/verifying and deriving of a proof the nonce from the proof block gets injected into the root of the document as a source of randomness which effects the blank node allocation. However the nonce is never revealed to the verifier when a derived proof is generated.

The reference set of the blank node corresponding to the nested object {"key_3": "A"} would not be affected by a random quad “linked” with the blank node of the credential itself (like random_quad_1 in the example above). However, if that does prove to be a problem, I think it could be solved by just adding some extra random quads where needed (maybe every second nested object?).

@dlongley perhaps you can comment more here? from a quick trial on JSON-LD playground it did seem to effect the allocation.

See example 1 and example 2

@OR13
Copy link
Contributor

OR13 commented Apr 25, 2023

I suggest this go in privacy considerations section, and we not add a pre processing requirement, since it increases complexity.

@BasileiosKal
Copy link
Author

This may make sense if rdf-canon ends up not allowing for the needed flexibility, however IMO this is not suited for the considerations section.

Privacy/security considerations are things that are easily avoidable, not required or that an implementer should look out for, so they will not unintentionally compromise security. This issue does not seem to fall in any of these categories.

It's fine to choose to not address the issue, but it should be very clear from the beginning of the document what properties the algorithm has. For example, claiming zk-selective-disclosure would be inappropriate.

@OR13
Copy link
Contributor

OR13 commented Apr 28, 2023

It's fine to choose to not address the issue, but it should be very clear from the beginning of the document what properties the algorithm has. For example, claiming zk-selective-disclosure would be inappropriate.

Yes, I am suggesting that be stated in the security considerations.

To be clear, this is my proposal:

  1. no complex transformations to nquads, treat them as an array of opaque bytes.
  2. pass the array of bytes directly into bbs interfaces that have security considerations we can refer to.
  3. document (in this spec), any details related to revealing a specific nquad, this includes commenting on the following:

3.1 PII in the subject, predicate, object or named graph component of the quad.
3.2 Inferences drawn from URDNA2015 naming schema for blank nodes.
3.3 Any additional security or privacy considerations that apply to nquads

We should make it clear that using bbs on data that inherently leaks relationships has security and privacy implications, and that these are unavoidable when nquads are used directly.

After this is done, we should consider if there is a better solution if there is time remaining in the wg (which seems unlikely, given current contribution to the document).

@OR13
Copy link
Contributor

OR13 commented Apr 28, 2023

Lets be honest about the default behavior, not hold up the work, waiting for solutions that might never come.

@dlongley
Copy link
Contributor

dlongley commented Apr 28, 2023

There are already solutions for the selective disclosure case that should be applied in the spec. The issuer should generate an HMAC key, and HMAC all c14nXXX blank node labels in the abstract output produced by rdf-canon to produce <base64url-hmac-digest> labels and then serialize to canonical N-Quads. The HMAC key should then be given to the holder so they can reproduce the same labels in selectively disclosed, rdf-canonized data.

This produces stronger privacy for selective disclosure than any other method that relies on structural information for disclosure (such as array index positions).

I don't think we need to wait on this -- it just needs to be implemented.

@dlongley
Copy link
Contributor

I've commented on the unlinkability vs. data leakage trade off here if we would like to mine it for the privacy considerations section: w3c/rdf-canon#84 (comment)

@OR13
Copy link
Contributor

OR13 commented Apr 28, 2023

There are already solutions for the selective disclosure case that should be applied in the spec. The issuer should generate an HMAC key, and HMAC all c14nXXX blank node labels in the abstract output produced by rdf-canon to produce <base64url-hmac-digest> labels and then serialize to canonical N-Quads.

I'm currently a -1 to doing this, but it is possible that if I could review an implementation or description of this, I would change my mind...

It seems like "wishful thinking" in similar category to "linked secrets", I prefer we cover the basics before discussing this.

This produces stronger privacy for selective disclosure than any other method that relies on structural information for disclosure (such as array index positions).

I don't think we need to wait on this -- it just needs to be implemented.

100% agree with this, but I am not planning to implement it, if I cannot read it in the current spec.

I question adding something to the spec that does not have multiple independent implementations... or at least 1 pretty strong one... where we believe there will quickly follow a second.

@dlongley are you saying we should only define the hmac transformed version in the current spec?

what is the timeline for documenting that approach, and is there an implementation of it that I can review?

@BasileiosKal
Copy link
Author

For me its +1. The solution is as simple as using HMAC (or any keyd hash) in place of sha256 in URDNA2015. We can then describe the linkability problems in privacy considerations.

I do agree that we should have a concrete proposal before resolving this though. I would be happy to help documenting the solution together with the proper considerations. As far as timelines go, I can try finding some time next week.

@dlongley
Copy link
Contributor

@BasileiosKal,

The solution is as simple as using HMAC (or any keyd hash) in place of sha256 in URDNA2015. We can then describe the linkability problems in privacy considerations.

Doing it this way still leaks the number of blank nodes via the labels. Instead, replacing the output canonical blank node labels with their HMAC'd values minimizes data leakage. IMO, a simple sweep over the abstract output from rdf-canon to replace these labels prior to serialization to canonical N-Quads is the best approach.

@dlongley
Copy link
Contributor

dlongley commented Apr 28, 2023

@OR13,

100% agree with this, but I am not planning to implement it, if I cannot read it in the current spec.

I think that's reasonable.

I question adding something to the spec that does not have multiple independent implementations... or at least 1 pretty strong one... where we believe there will quickly follow a second.

Understood.

@dlongley are you saying we should only define the hmac transformed version in the current spec?

I think there are two use cases:

  1. Selective disclosure with minimized data leakage.
  2. Unlinkability.

I'm strongly in favor of supporting the first case -- and it requires the hmac transformed version. So I'm +1 for at least defining that version. I don't feel strongly about the second case.

If other people want the unlinkability version (for which I think the use cases are scant at this time and perhaps even flawed), we can define the untransformed version as well and tell people to minimize the number of blank nodes to prevent data leakage -- which is required anyway to limit linkability via the data revealed. But the utility of this, to me, seems far less than selective disclosure w/minimized data leakage.

We can define these sorts of things (hmac transformation, skolemized framing for filtered message matching, etc.), as generalized DI-SD primitives to be composed and built on top of, making implementations simpler and reusable. This sort of thing is being worked on as quickly as it can be as inputs to this work.

@OR13
Copy link
Contributor

OR13 commented Apr 29, 2023

Ok, let's focus only on the hmac N-Quads transformer... Show me how it works.

And I will see if I can get to being a +1 to adding it, but we will of course still need to convince the WG anyway.

@dlongley
Copy link
Contributor

dlongley commented Apr 29, 2023

@OR13,

Here's some naive pseudo-code to replace c14nXXX labels with HMAC'd labels instead:

hmac = createHmac('sha256', someKey);
nquads = canonize(document);
result = nquads
  .split('\n')
  .map(nq => nq.replace(/(_:([^\s]+))/g, (m, p1, p2) => '_:' + base64url.encode(hmac.sign(p2))) + '\n')
  .sort();

The result has an array of sorted, canonical N-Quads w/reproducibly-randomized blank node labels. This array can be used as the set of "messages" (or the inputs for the "messages") to be passed to BBS.

We're also working in the RCH WG to enable the rdf-canon algorithm to return a normatively referenceable abstract representation of the canonized dataset -- where the blank node labels could be replaced there before serialization. This produces the same result but may be easier to write in normative spec text. Either way should work fine for implementations.

@BasileiosKal
Copy link
Author

@dlongley Thank you! I think I understand the proposal now.

This for me is a -1 for the following reasons;

  1. It completely breaks unlikability even if 1 bn is revealed. HMAC keys will need to be unique at least per user, making the "hashed bns" global correlation vectors.

  2. Replacing sha256 with HMAC when using URDNA solves the data leak issue, while retaining unlinkability in most cases. We can then add privacy considerations, describing the degree of unlinkability as a function of the number of revealed bns.

In the end, I dont see a reason to HMAC the URDNA result rather than replacing sha256 with it. Having implementation experience with that solution, it is simpler and more robust.

 

P.S., choosing between sd and unlinkability IMO is wrong. If I can be tracked at all times, then sd has no meaning. There's a reason the crypto layer has gone to great lengths to guarantee both properties.

@BasileiosKal
Copy link
Author

Doing it this way still leaks the number of blank nodes via the labels.

Yea, it will reveal a lower threshold for the number of bns but I see the point. However, the number of signed messages (i.e., the size of the rdf dataset) will be revealed either way, so I don't see the significance??

@dlongley
Copy link
Contributor

dlongley commented Apr 30, 2023

@BasileiosKal,

@dlongley Thank you! I think I understand the proposal now.

This for me is a -1 for the following reasons;

  1. It completely breaks unlikability even if 1 bn is revealed. HMAC keys will need to be unique at least per user, making the "hashed bns" global correlation vectors.

As I stated in my linked comment, I don't think there is any reasonable, general purpose mechanism today that supports arbitrary VC modeling coupled with both unlinkability and minimizing data leakage.

In fact, without getting involved in both the modeling and the crypto layer, I suspect they are diametrically opposed goals. I don't think you can avoid significant VC modeling design constraints in the unlinkability case without losing data leakage minimization. The layers are mixed here; you can't "bolt on" unlinkability.

Therefore, we should take two different approaches for each case.

I believe that, for arbitrarily designed VCs, the best way to minimize data leakage with the tools we have today is what I proposed above. People that expect their VCs to be able to be selectively disclosed (but accept they are linkable) should use this mechanism.

For unlinkability, if we even want to support it in this iteration, I think no hashing should be done at all and instead the number of blank nodes should be severely constrained or just fixed to some new, special existential-variable identifier scheme. After all, the goal is to avoid linking the data to anything anyway. Then, any VC that supports this MUST have a flat claim structure with no arrays. Perhaps a good rule of thumb is: If there's enough data to be leaked, there's enough data to be linked.

The fact is that too much complexity in the data that is being signed itself will either break unlinkability or break leakage minimization by trying to be unlinkable. So putting effort into removing linkability in the securing mechanism itself is of little value without also requiring a limit to the complexity in the data being signed. This implies VC designers must know this and bake it into their designs for the unlinkable case. I think that's just a fact with the technology we have now -- and may even end up being provably true generally. They don't need to do this if unlinkability is not a goal (in fact, the exact opposite, linkability, is often a goal for these cases).

Now, there are other efforts that may help with creating more complex, unlinkable VCs in the future by using "term-wise" signing, but those aren't ready on the timeline we need for this iteration of the BBS suite. None of the implementers here (that I'm aware of) has proposed using these techniques or tried to see if they suffer from similar problems.

I think we need to move forward knowing that this iteration requires different approaches depending on the features we want.

  1. Replacing sha256 with HMAC when using URDNA solves the data leak issue, while retaining unlinkability in most cases. We can then add privacy considerations, describing the degree of unlinkability as a function of the number of revealed bns.

It does not solve the data leak issue, as the number of blank nodes is leaked. When Alice reveals _:c14n20, the verifier knows something about her that she didn't intend, for example, that she has three children. Additional mitigations would be required here that introduce more complexity in VC designs. One example of such a mitigation is adding "dummy" blank nodes to the data to cause the same number of blank nodes to always be signed. However, we shouldn't presume that that will entirely solve the problem or that it's even easy to do properly. I expect it to require coordination at the data modeling layer which, I think, is quite undesirable (and impractical).

These things can be completely avoided by bifurcating VC designs into two different classes of VCs:

  1. VCs that always support unlinkability and selective disclosure with minimized data leakage.
  2. VCs that only support selective disclosure with minimized data leakage.

The first class can only be met by keeping the VC design extremely simple and minimal with unlinkability intentionally built in. Arbitrarily designed, interconnected webs of information will fail to fit into the first class, no matter what we do using today's cryptographic schemes. The VC designer MUST be involved here.

The second class can be met by allowing arbitrary VC design but by sacrificing unlinkability. The use cases for VCs that fit into this class will almost always (if not always) require linkability via correlation in the revealed data anyway.

In the end, I dont see a reason to HMAC the URDNA result rather than replacing sha256 with it. Having implementation experience with that solution, it is simpler and more robust.

See the above for the reason. I don't think it's workable to have a single solution here.

P.S., choosing between sd and unlinkability IMO is wrong. If I can be tracked at all times, then sd has no meaning. There's a reason the crypto layer has gone to great lengths to guarantee both properties.

It sounds like we have different opinions then :).

I suspect it may be the case that "choosing between SD and unlinkability" could be formally proven to be a requirement depending on the complexity of the data. It's very similar to having to choose between a "perfectly hiding" or a "perfectly binding" commitment scheme; you can't have both.

But, to work with the simpler argument here. I think SD has plenty of meaning when you have a VC that has:

  1. SSN
  2. Your full name
  3. Your zip code

And you just want to reveal your full name and zip code. That fact that you are highly identifiable by your full name plus your zip code does not mean that you want to reveal your SSN to everyone that also knows those pieces of information. This is either a clear and logical rejection of your statement above -- or I've misunderstood it.

Yea, it will reveal a lower threshold for the number of bns but I see the point. However, the number of signed messages (i.e., the size of the rdf dataset) will be revealed either way, so I don't see the significance??

Revealing the size of the RDF dataset doesn't necessarily mean that the number of blank nodes is revealed. It's also much simpler to address that at the crypto layer alone vs. getting involved at the data modeling layer. If the blank node labels themselves reveal no information about the number of blank nodes, the crypto layer can more easily make adjustments to the number of messages without concern for how the data is modeled regarding what leaks might occur.

@BasileiosKal
Copy link
Author

@dlongley thank you very much for you answer!

It sounds like we have different opinions then :).
But, to work with the simpler argument here. I think SD has plenty of meaning when you have a VC that has:

Just to try and clarify my position using that example! Imagine in a later time I am required to reveal my SSN and zip code, but i dont want to reveal my name. The fact I can be tracked at all times means that there is nothing stopping the Verifier(s) or any adversary (like some governments ;P) from deriving my name from my old presentations.

It does not solve the data leak issue, as the number of blank nodes is leaked. When Alice reveals _:c14n20, the verifier knows something about her that she didn't intend, for example, that she has three children. Additional mitigations would be required here that introduce more complexity in VC designs.

Just to make sure I understand. Your argument against using HMAC internally in URDNA is that it will leak information about the total number of bns?

Because, worst case scenario, this can leak some information about the length of lists with nested objects. So to this my points are;

  1. This seems trivially addressed. The recommendation will be as simple as “add a random number of elements to all lists with nested objects” or “make all lists with nested objects the same length by adding random elements”. Where element = {randKey: randValue}. Or we could simply worn about the possible repercussions of using such lists.
  2. This does not seem to require any coordination on the data modeling layer and can even happen internally as part of a ciphersuite operation.
  3. Most importantly, to me even if we could not address that concern (which if Im not mistaken we can) this does not seem nearly as bad as completely braking unlinkability.

It's also much simpler to address that at the crypto layer alone vs. getting involved at the data modeling layer

I don’t think it will be much easier to address these concerns in the crypto layer. If the issuer decides to include random messages while signing, then they will also need to communicate the {index: randMessage} map to the user. This is not something easily implemented with the current libraries AFAIK. You will either need changes in the data modeling layer to include that map in the VC somehow, or directly use the crypto layer libs (not ideal IMO).

On the other hand, using HMAC internally in URDNA is very easily implemented with the current libraries and does not require from developers any more knowledge about what happens internally.

It also does not completely break unlinkability so in the end I’m sorry, but I still don’t understand why to not use it.

@dlongley
Copy link
Contributor

Once PR w3c/vc-di-ecdsa#19 is merged, this issue can be addressed by adding text that either uses the output of hmacIdCanonize directly or that calls it internally and then sorts by HMAC output and produces a label map that assigns numbered bnode labels according to that sort order.

@dlongley
Copy link
Contributor

dlongley commented Nov 28, 2023

PRs #98 / #99 (some rebasing adjustment was needed) have addressed this issue. Now bnode labels are HMAC-randomized, sorted, and then relabeled with a simple counter to maximize group-hiding. This can be closed as addressed.

@Wind4Greg
Copy link
Collaborator

Please see the updated document for mechanism details that address these concerns. See PR #101 for privacy considerations related to both data leakage and unlinkability, including analysis.

@Wind4Greg Wind4Greg added pending-close The issue will be closed in 7 days if there are no objections. and removed ready-for-pr labels Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending-close The issue will be closed in 7 days if there are no objections.
Projects
None yet
Development

No branches or pull requests

6 participants