Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting protobuf preserving unknown fields #2655

Open
xiaozhikang0916 opened this issue Apr 26, 2024 · 9 comments · May be fixed by #2860
Open

Supporting protobuf preserving unknown fields #2655

xiaozhikang0916 opened this issue Apr 26, 2024 · 9 comments · May be fixed by #2860
Labels

Comments

@xiaozhikang0916
Copy link
Contributor

I would like to support preserving unknown fields for ProtoBuf, here is an issue for discussion beforehand.

Motivation

Basically it is required in protobuf spec, so it needs to be implemented to make kotlinx.serialization a valid protobuf parser.

Basic design

A new helper class will be introduced

interface UnknownFieldsHolder: Map<ProtoId, ProtoField>

Users can add to class property according to their needs:

@Serializable
data class ForwardCompatibleData(
    @ProtoNumber(1) val name: String,
    // No @ProtoNumber needed
    // Maybe introducing a new annotation?
    val unknownFields: UnknownFieldsHolder,
)

Behaviours

  • The ProtoBuf format should keeps all undeclare fields in the UnknownFieldsHolder when deserializing;
  • All data in the UnknownFieldsHolder should be written back to wire as is during serializing;
  • If new fields are added in oneof field, the new value will be stored in UnknownFieldsHolder, leaving the @ProtoOneOf property to be null;
  • Other formats should not deseralize or serialize this property.

Further more

Enumeration

Enums in protobuf can also have unknown values in old version of code, but it is much more easier to store it in sealed interface / objects styles:

sealed interface IPhoneType {
  val value: Int
}
data object WorkPhone: IPhoneType {
  override val value: Int
    get() = 1
}
data object HomePhone: IPhoneType {
  override val value: Int
    get() = 2
}
data class UnknownPhoneType(override val value: Int): IPhoneType

Other formats

Do other formats need similar functionality?

@pdvrieze
Copy link
Contributor

@xiaozhikang0916 XML has this option. Both for tags and attributes (but they need to be stored separately). What you need is:

  1. A way of recording arbitrary Protobuf values (equivalent to JsonElement or xml Node)
  2. A container for unknown elements (it makes a lot of sense to require this to be explicitly annotated)

@xiaozhikang0916
Copy link
Contributor Author

A way of recording arbitrary Protobuf values

It is required for this feature, but I will keep it an internal interface, because I believe using protobuf in active code should be restricted in the scope of serializable class, and map-like structure ( with integer key of proto id ) is not dev-friendly.

@pdvrieze
Copy link
Contributor

@xiaozhikang0916 You probably want some way to specify which numbers could be omitted (deprecated fields), or whether to write these values at all (changes in known attributes may change the validity of the opaque unsupported attributes).

@xiaozhikang0916
Copy link
Contributor Author

@pdvrieze I got your point, but IMO an option to omit, or ignore any fields ( deprecated or not ) is also a breach of the protobuf spec.

If a deprecated field is not intented to be used by developers, just omit it in the declaration or keep the property internal/private. In anyway, this field should present in the wire date, and the wire data is not accessible to the user.

@pdvrieze
Copy link
Contributor

@xiaozhikang0916 What I'm considering is the case where the data is read, processed and serialized. In some cases it is semantically valid to retain the (extra) data, in others it is not. It may be possible to make a copy of the container without the "unknown field" data to avoid serializing invalid data, but this may be challenging for deeply nested hierarchies.

As a protocol the protobuf specification can not say anything of program semantics. The semantics are per definition program specific, only the developers of the program can decide what the correct way is to handle unknown data. The protobuf specification only states that fields are preserved during parsing and serialization, this is not about reading in data, doing some application specific stuff and then saving it again, rather it is to do with transformation of data through different paths.

As to the deprecated fields, an example there is that you have a deprecated localtime field, as well as a current utcTime field. In the example the localTime is no longer present in the data type. When then the utcTime is updated the localTime is no longer valid (the time changed), but because the application no longer knows about localtime it cannot update this "unknown field". The only correct behaviour in that case is to not emit unknown (but possibly incorrect) fields.

In other cases, say you want to add security headers to a message, it is perfectly valid to just encapsulate an opaque message and preserve all fields, it depends on the context. What I'm suggesting is a way (either using annotations or as a format option) to configure the behaviour.

@Flavien
Copy link

Flavien commented Jun 4, 2024

This is kind of critical, and should also work for JSON. Most JSON libraries, have first-class support for this (e.g. in .NET)

@pdvrieze
Copy link
Contributor

pdvrieze commented Jun 4, 2024

@Flavien The challenge here is that the serialization library (and its formats) is designed to support serialization and deserialization (like most serialization libraries there is a bias towards the code side, not the wire format, although the bias is small).
Deserialization in this context is seen as reading data in a format (eg. protobuf or Json) and reading that data in as a native datastructure to be used in the program. Serialization is the reverse (writing a readable version of the data structure to a wire/storage format). Handling arbitrary data can be supported, but this is by necessity format (and data structure) specific. In the case of Json you can do it by having JsonObject properties that can contain arbitrary nested Json. Protobuf can do the same thing.
Note that for Json the library provides direct access to the underlying Json parser. I am not sure that is true for Protobuf.

@xiaozhikang0916
Copy link
Contributor Author

It is pretty good to have such feature in more formats, but it needs efforts on every formats.

And if it will be implemented in multiple formats e.g. xml (already have), protobuf (planned in this issue) and json (requested in #2698 and #2701), it is worthy to have some support in the base interface defined in core lib.

@pdvrieze
Copy link
Contributor

pdvrieze commented Jun 5, 2024

@xiaozhikang0916 In principle library support is good, but in this case the only element I can really see is to provide an annotation. The problem is that the actual data type that needs to be used is format dependent, so I don't think this can be supported well in a format/agnostic way (if a generic supertype is used then it can't be encoded - especially by a different format)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants