Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient payload encoding #289

Open
AdamZWu opened this issue Oct 5, 2023 · 10 comments
Open

Inefficient payload encoding #289

AdamZWu opened this issue Oct 5, 2023 · 10 comments
Labels
enhancement New feature or request

Comments

@AdamZWu
Copy link
Contributor

AdamZWu commented Oct 5, 2023

The in-toto specs use DSSE to contain the statement data and carry signatures.

  • The statement is first serialized as a JSON string;
  • Then base64 encoded and stored in the "payload" field of DSSE;
  • Then the whole envelope is serialized as a JSON string.

The base64 encoding step incurs a 33% overhead, which really has no benefit, because the JSON serialized statement is already a legit string. This overhead will result in adoption difficulties for resource-constrained CI/CD, and is generally wasteful (e.g. some complex artifacts may generate 1GB provenance data; and 33% of that is 330MB, to store and transfer for every build).

Could we have some more efficient solutions?

For example, updating DSSE spec to not require all payload be base64 encoded -- the MIME type "application/vnd.in-toto+json" would indicate the payload is in text format and can be directly consumed.

Alternatively, if we must use base64 for DSSE payload, could we introduce something like "application/vnd.in-toto+json+lzma" which compresses the serialized statement first?

@TomHennen
Copy link
Contributor

I seem to recall it base64 was chosen regardless of the string type partly because it helps avoid problems of deserialization attacks and probably also to prevent having to escape stuff like this.

However, I do wonder if it would help to specify an encoding that is base64(compress(SERIALIZED_BODY)).

I seem to remember we may have done some experimentation on this internally?

@adityasaky
Copy link
Member

adityasaky commented Oct 5, 2023

Side note: does this belong in https://github.com/secure-systems-lab/dsse, apart from any changes to in-toto's media type as a consequence of a DSSE change?

@AdamZWu
Copy link
Contributor Author

AdamZWu commented Oct 5, 2023

@adityasaky: if we are changing the base64 encoding, then it is more of a DSSE work; if we are not changing that, but compressing serialized statements, then I think it is an in-toto work.

@TomHennen
Copy link
Contributor

I suspect

  1. We'd want https://github.com/secure-systems-lab/dsse to make some recommendations on what/how to do this
  2. https://github.com/in-toto/attestation/blob/main/spec/v1/envelope.md#fields would need to be updated either way?

@TomHennen
Copy link
Contributor

So generally folks are open to some solution here. We'd probably be looking for a PR that defines whatever the proposal is along with some code that actually does it.

@TomHennen TomHennen added enhancement New feature or request and removed triage labels Oct 13, 2023
@MarkLodato
Copy link
Contributor

This issue should be closed in favor of secure-systems-lab/dsse#63. All changes need to happen there, since this is a DSSE issue.

@AdamZWu
Copy link
Contributor Author

AdamZWu commented Oct 17, 2023

Also mentioned in secure-systems-lab/dsse#63:

As another alternative, the bundle format selected by in-toto, JSON lines, also offers a compression mode.

If we were to compress the bundle, does that make it again an in-toto issue? :P


Maybe compressing the bundle would yield a better data reduction, as a bundle will likely contain multiple attestations, and if these attestations are for the same set of artifacts, the "subject" fields will be repeated multiple times. A bundle-level compression would be able to discover the redundancy, something a statement-level compression cannot achieve.

@deeglaze
Copy link

The CoRIM draft is proposing a COSE_Sign1 envelope around a CBOR-serialized object for compact representation. Avoid JSON altogether. It seems like they are really similar to in-toto, just that they have some notions of predefined predicates and limited extensibility... I'm still trying to suss out where we can remove redundancy across the two efforts.

@marcelamelara
Copy link
Contributor

Related #361

@TomHennen
Copy link
Contributor

Discussed at today's attestation maintainers meeting.

We're open to both of these things. Our main concern would be on interoperability. Having multiple ways to encode and represent attestations could significantly hinder adoption. One way to resolve this might be with a 'generic' converter that could convert newer encodings to a canonical JSON encoding as needed.

We'd be happy to review PRs if folks who are highly motivated here want to submit them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants
@MarkLodato @deeglaze @TomHennen @adityasaky @AdamZWu @marcelamelara and others