Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote_exec.proto: support ZSTD with dictionary #276

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sluongng
Copy link
Contributor

Closes #272

@sluongng
Copy link
Contributor Author

cc: @mostynb @werkt @EdSchouten

@sluongng
Copy link
Contributor Author

This PR is/will be a bit stale because I am a bit occupied with other tasks for the time being.

I could revisit this if there are more demands for such capability, or if somebody else wants to pick up the flag, please feel free to.

@sluongng
Copy link
Contributor Author

Since we might have some use case for this via bazelbuild/bazel#18997, I have pushed a new version of this PR according to review feedbacks.

As for @EdSchouten suggestion about enhancing GetCapabilitiesRequest for better negotiation. I think that should be done in a separate PR from this one? Wdyt?

@EdSchouten
Copy link
Collaborator

EdSchouten commented Jan 22, 2024

As for @EdSchouten suggestion about enhancing GetCapabilitiesRequest for better negotiation. I think that should be done in a separate PR from this one? Wdyt?

Why can't we just do it right as part of this PR? What's preventing us from doing that?

If we don't add full negotiation to GetCapabilities itself, what are your thoughts on adding a dedicated RPC method to ContentAddressableStorage? For example, a GetCompressionDictionaries().

service ContentAddressableStorage {
  rpc GetCompressionDictionaries(GetCompressionDictionariesRequest)
    returns (GetCompressionDictionariesResponse);
}

message GetCompressionDictionariesRequest {
  string instance_name = 1;
  DigestFunction.Value digest_function = 2;
  Compressor.Value compressor = 3;
}

message GetCompressionDictionariesResponse {
  message Dictionary {
    uint32 id = 1;
    Digest digest = 2;
  }

  // The dictionary ID that clients SHOULD use when uploading
  // objects into the Content Addressable Storage. An entry for
  // this dictionary ID MUST be present in `dictionaries`.
  uint32 default_dictionary_id = 1;

  // Dictionaries that may be used by the server when returning
  // objects stored in the Content Addressable Storage. The key
  // corresponds to a dictionary ID, as described in RFC 8878,
  // section 5, while the value refers to a dictionary
  // as described in RFC 8878, chapter 6.
  repeated Dictionary dictionaries = 2;
}

@sluongng
Copy link
Contributor Author

Why can't we just do it right as part of this PR? What's preventing us from doing that?

Initially, I thought it could make it harder to get this merge. But I looked at it and found it to be less disruptive than I initially thought. So I pushed a commit that does just that, the language could use some help though 🙏 .

what are your thoughts on adding a dedicated RPC method to ContentAddressableStorage? For example, a GetCompressionDictionaries().

I think it would still require us to pass a minimum boolean value in ServerCapabilities to indicate that the server does support that RPC.
And then, the additional RPC simply returns what could be a simple CAS entry... so a glorified ByteStream.Read call.

So instead of sending the boolean value in ServerCapabilties, let's just send the digest and use the existing ByteStream.Read RPC instead.

The caveat is ofc, the server would have to store that CAS value somewhere, which could potentially limit how fast the server could swap out the dictionaries. Realistically, I don't expect the dictionary list to change too frequently for most use cases as it will change the delta calculation, leading to more frequent cache invalidation.

Lastly, I think we could reserve the dedicated RPC for when the dictionary negotiation gets more complicated. But for now, this would do.

@EdSchouten
Copy link
Collaborator

No, it doesn't. The presence of ZSTD_DICT can be used by the client to determine whether that additional RPC is supported.

@sluongng
Copy link
Contributor Author

Discussed with @EdSchouten offline. As some server/client/proxy are caching the ServerCapabilities based on the GetCapabilitiesRequest, adding more fields to GetCapabilitiesRequest could further complicate the existing caching mechanism.

Sent a change to move the Dictionary fetching to a dedicated RPC in hope that this change has minimal impact on existing implementations. I still want to make this rpc ZSTD specific, but if there is a strong opinion on making it generic for future compressors, suggestions are welcome.

@bduffany
Copy link

I still want to make this rpc ZSTD specific, but if there is a strong opinion on making it generic for future compressors, suggestions are welcome.

Seems to me that compression dictionaries are simple enough where they wouldn't vary much across compressor types, so if it were compressor-specific then there would eventually wind up being a bunch of different RPCs with the same Request/Response signature? (I don't feel super strongly either way, just something to consider)

@sluongng
Copy link
Contributor Author

I still want to make this rpc ZSTD specific, but if there is a strong opinion on making it generic for future compressors, suggestions are welcome.

Seems to me that compression dictionaries are simple enough where they wouldn't vary much across compressor types, so if it were compressor-specific then there would eventually wind up being a bunch of different RPCs with the same Request/Response signature? (I don't feel super strongly either way, just something to consider)

I see your point. The difficult part, for me, is not being able to plan a head for what the other compressors' dictionary support is going to look like. Without investing too much time into Brotli compression, I could only make guesses.

For example, I tried to look into Brotli shared dictionary and I don't think there is a dictionary ID attached. Quickly reading through Chrome Documentation and Cloudflare Blog, I think the dictionary is referred to by a combination of the media type that it's used for, and it's CAS SHA256 ID.

If that's indeed the case, then I think we could make the RPC generic with something like @EdSchouten suggested, with slight modification.

service ContentAddressableStorage {
  rpc GetCompressionDictionaries(GetCompressionDictionariesRequest)
    returns (GetCompressionDictionariesResponse);
}

message GetCompressionDictionariesRequest {
  string instance_name = 1;
  DigestFunction.Value digest_function = 2;
  Compressor.Value compressor = 3;
}

message GetCompressionDictionariesResponse {
  message ZstdDictionary {
    uint32 id = 1;
    Digest digest = 2;
  }

  // The ZSTD dictionary ID that clients SHOULD use when uploading
  // objects into the Content Addressable Storage. An entry for
  // this dictionary ID MUST be present in `zstd_dictionaries`.
  uint32 default_zstd_dictionary_id = 1;

  // ZSTD Dictionaries that may be used by the server when returning
  // objects stored in the Content Addressable Storage. The key
  // corresponds to a dictionary ID, as described in RFC 8878,
  // section 5, while the value refers to a dictionary
  // as described in RFC 8878, chapter 6.
  repeated ZstdDictionary zstd_dictionaries = 2;
}

That way GetCompressionDictionariesResponse could be extended in the future to support other compressors (i.e. Brotli)? WDYT?

@sluongng
Copy link
Contributor Author

sluongng commented Jan 31, 2024

A second related problem to this PR I have been thinking about is how to enable dictionary selection on client-initiated uploads.

For example: let's say I want to do a bazel build //some/service/a:binary target and upload the result binary to the remote cache server. Is there a way for the client to compress the binary with a dictionary specific to //some/service/a:binary action or //some/service/a package?

Initially, I wanted to modify the dictionary message to include metadata like this

 message ZstdDictionary {
     uint32 id = 1;
     Digest digest = 2;
+    repeated string prefer_for = 3;
 }

where prefer_for could be a set of build actions mnemonics and/or build targets.

But now I am thinking that we should make a standalone map for this

 message GetCompressionDictionariesResponse {
   message ZstdDictionary {
     uint32 id = 1;
     Digest digest = 2;
   }

   uint32 default_zstd_dictionary_id = 1;

   repeated ZstdDictionary zstd_dictionaries = 2;

+  message DictionaryHint {
+    string hint = 1;
+    Digest dictionary_digest = 2;
+  }
+  repeated DictionaryHint dictionary_hints = 3;
 }

for faster lookup 🤔.


message GetZstdDictionariesResponse {
message Dictionary {
uint32 id = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we could make dictionaries reasonably generic by replacing this uint32 with an Any. That wouldn't account for session-specific dictionaries, but it should allow for ~any static dictionary. The unpacking of the Any, and understanding what blobs it applies to, is very much compressor-specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an "id" specific field, I would say we should use string instead of Any at the minimum. Any delegates the responsibility down to implementations to validate the data format which could be a pain to work with.

If generic is a requirement here, I think something like this could be a good alternative

message GetDictionariesResponse {
  message DefaultDictionaryId {
    oneof id {
      uint32 zstd_dictionary_id = 1;
    }
  }
  // The dictionary ID that clients SHOULD use when uploading
  // objects into the Content Addressable Storage. An entry for
  // this dictionary ID MUST be present in `dictionaries`.
  DefaultDictionaryId default_dictionary_id = 1;

  message ZstdDictionary {
    uint32 id = 1;
    Digest digest = 2;
  }
  message ZstdDictionaries {
    repeated ZstdDictionaries zstd_dictionaries = 1;
  }
  message Dictionaries {
    oneof dictionaries {
      ZstdDictionaries zstd_dictionaries = 1;
    }
  }

  // Dictionaries that may be used by the server when returning
  // objects stored in the Content Addressable Storage. The key
  // corresponds to a dictionary ID, as described in RFC 8878,
  // section 5, while the value refers to a dictionary
  // as described in RFC 8878, chapter 6.
  Dictionaries dictionaries = 2;
}

This would enable future Dictionary types (if there were any) could be added incrementally. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be fine with a string, but given the conversation yesterday regarding people wanting more structured types to avoid the need to shove json into string fields, thought that Any might be appropriate. Using a string for an ID field would be more consistent with the rest of the API. In any case, I don't think that using string vs Any absolves the client from the need to validate the input. In this case of zstd, for example, the client will need to parse the string into an int ID to match with the ID encoded in the compressed blob.

At some level what I'm trying to avoid is needing to modify this API with each dictionary-compression function that we add, which means trying to remove anything zstd-specific. It's not realistic to design an API that will work in all cases without investing significantly in understanding, but there are only so many realistic ways to get dictionary-based compression to work, and I have to believe that "encode a dictionary ID in the compressed blob" is reasonably common.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an "id" specific field, I would say we should use string instead of Any at the minimum. Any delegates the responsibility down to implementations to validate the data format which could be a pain to work with.

But doesn't a string also require that an implementation validates the data format? There are also strings that are not valid integers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support compression with external dictionary
5 participants