Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement WriteStringValueSegment defined in Issue 67337 #101356

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

ificator
Copy link

This is a quick attempt at implementing WriteStringValueSegment as defined here:
#67337 (comment)

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 21, 2024
Copy link
Member

@eiriktsarpalis eiriktsarpalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these changes to be useful, the built-in converters for string and byte[] would need to be updated to make use of the new APIs. This would involve having the conveter types derive from the internal JsonResumableConverter<T> which supports async serialization.

@davidfowl
Copy link
Member

@eiriktsarpalis can we split those changes or do you want them to be in this PR as well?

@eiriktsarpalis
Copy link
Member

can we split those changes or do you want them to be in this PR as well?

Are you asking if we can merge the PR without pending feedback being addressed? I don't think so, it's adding API that wasn't approved and the handling of split surrogate pairs/UTF-8 code points needs to be addressed as a matter of functional correctness and security.

Copy link
Member

@eiriktsarpalis eiriktsarpalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating PR status to reflect its current state.

@davidfowl
Copy link
Member

Asking about the converter part of the PR, not the initial API (wasn't referring to all of the feedback).

@eiriktsarpalis
Copy link
Member

Ah yes, that is largely optional for the scope of this PR. (although we should definitely follow up with an implementation once this is merged because that's where you get the most impact from this feature)


private void ClearPartialCodePoint() => PartialCodePointRaw[3] = 0;

private void WriteInvalidPartialCodePoint()
Copy link
Member

@eiriktsarpalis eiriktsarpalis Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming by "invalid" in this case you mean to say that it's incomplete? Isn't that already implied by using "partial"? If it actually was invalid, you wouldn't want to write it correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the method body I take it "invalid" means that it requires escaping? Is it possible to determine if a a partial code point requires escaping?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller only calls this when the code is invalid, but what this actually does is write replacement characters for the partial code point. By invalid, I mean that the previous write which set the partial code point was, say, UTF-8 and the new one is UTF-16, so this code point needs to be cleared. Maybe FlushPartialCodePoint or similar is a better name (though Utf8JsonWriter.Flush doesn't flush partial code points, so that could be confusing for a different reason).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably go with something more verbose, e.g. WriteReplacementCharForInvalidPartialCodepoint

jsonUtf8.Flush();
JsonTestHelper.AssertContents("\"", output);

// Writing empty UTF-16 sequence will dump the partial UTF-8 code point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why that happens. Naively, I would expect that an empty UTF-16 sequence should be a no-op.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever encoding changes, we just dump the partial code point regardless of the content/size of what's being written (the empty string here), so this behavior is just consistent with that. We can special case this to check if the write is empty and non-final then it can be a no-op. One reasoning for the current behavior is if the user is just writing arbitrary strings then it might be confusing when some of them dump the partial code point and some don't until they track it down to the content.

I consider the whole scenario of switching encodings (especially when there's a partial code point) to be "undefined behavior" from a user's perspective. @GrabYourPitchforks mentioned that our other StreamReader/Writer APIs just misbehave when encodings change across operation.

Copy link
Member

@eiriktsarpalis eiriktsarpalis Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever encoding changes, we just dump the partial code point regardless of the content/size of what's being written (the empty string here), so this behavior is just consistent with that.

Sure, it just surprises me that this should happen in the event of a no-op call. No-ops should be just that. I'm not super comfortable with this hybrid mode if I'm honest, it seems to be introducing surprising behavior like this one. I would recommend just disabling it and requiring that the user always specifies the same encoding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fuzzer capable of randomly switching back and forth between encodings for a single string?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fuzzer capable of randomly switching back and forth between encodings for a single string?

Yes, the fuzzer just gets an input of bytes that we can interpret any way we want, so we can implement this encoding switching behavior by just passing a span of the bytes into the writer if we want utf8 or marshalling a span of the bytes into chars and passing that into the writer for utf16. Currently the fuzzer test I added splits the data into 3 segments and does test switching encodings but it's easy enough to make it N splits where N is determined by the input.

Sure, it just surprises me that this should happen in the event of a no-op call.

In some sense, even if the default values are passed in for the argument (empty input span, isFinal = false), it still isn't a nop call because there is no default for the encoding and the user needs to select one by choosing the overload that they use.

I would recommend just disabling it and requiring that the user always specifies the same encoding.

Yeah, when encoding switches it will more likely be due to lack of validation or some other error by the caller and we should fail fast instead of silently continuing. If there is a desire for this in the future we should probably make it opt-in through JsonWriterOptions.

Co-authored-by: Eirik Tsarpalis <[email protected]>
@MihaZupan
Copy link
Member

@MihuBot fuzz JsonWriter

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 12 out of 27 changed files in this pull request and generated 1 comment.

Files not reviewed (15)
  • src/libraries/Fuzzing/DotnetFuzzing/DotnetFuzzing.csproj: Language not supported
  • src/libraries/System.Text.Encodings.Web/src/Resources/Strings.resx: Language not supported
  • src/libraries/System.Text.Encodings.Web/src/System.Text.Encodings.Web.csproj: Language not supported
  • src/libraries/System.Text.Encodings.Web/tests/System.Text.Encodings.Web.Tests.csproj: Language not supported
  • src/libraries/System.Text.Json/src/Resources/Strings.resx: Language not supported
  • src/libraries/System.Text.Json/src/System.Text.Json.csproj: Language not supported
  • src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.Helpers.cs: Evaluated as low risk
  • src/libraries/System.Text.Json/src/System/Text/Json/Reader/JsonReaderHelper.Unescaping.cs: Evaluated as low risk
  • src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.cs: Evaluated as low risk
  • src/libraries/System.Text.Json/src/System/Text/Json/ThrowHelper.cs: Evaluated as low risk
  • src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.Comment.cs: Evaluated as low risk
  • src/libraries/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.Escaping.cs: Evaluated as low risk
  • src/libraries/System.Text.Json/ref/System.Text.Json.cs: Evaluated as low risk
  • src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.cs: Evaluated as low risk
  • eng/pipelines/libraries/fuzzing/deploy-to-onefuzz.yml: Evaluated as low risk

@davidfowl
Copy link
Member

Does this also handle the large base64 byte[] case?

@PranavSenthilnathan
Copy link
Member

Does this also handle the large base64 byte[] case?

It does not, I'm planning on doing that as a follow up to this PR. The remaining work for this feature is:

  • support base64 in the writer
  • modify the built-in converters
    • byte[] converter to use the base64 segmented api
    • string converter to use the string segmented api

@@ -45,4 +45,24 @@ static void Throw(ReadOnlySpan<T> expected, ReadOnlySpan<T> actual)
throw new Exception($"Expected={expected[diffIndex]} Actual={actual[diffIndex]} at index {diffIndex}");
}
}

public static void Throws<T, TState>(Action<TState> action, TState state)
where T : Exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where T : Exception
where TException : Exception

}
catch (T)
{
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider returning the caught exception, since that's what existing Assert.Throws* methods typically do.

<value>Cannot extract a Unicode scalar value from the specified index in the input.</value>
</data>
<data name="CannotMixEncodings" xml:space="preserve">
<value>Cannot mix encodings between string value segments. The previous segment's encoding was '{0}' and the current segment's encoding is '{1}'.</value>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider updating the wording here (and in the previous messages) to use "not supported" instead of "cannot":

Suggested change
<value>Cannot mix encodings between string value segments. The previous segment's encoding was '{0}' and the current segment's encoding is '{1}'.</value>
<value>Mixing UTF encodings in a single multi-segment JSON string is not supported. The previous segment's encoding was '{0}' and the current segment's encoding is '{1}'.</value>

Copy link
Member

@eiriktsarpalis eiriktsarpalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @ificator and @PranavSenthilnathan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Text.Json community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants