Implement WriteStringValueSegment defined in Issue 67337 #101356

ificator · 2024-04-21T20:30:44Z

This is a quick attempt at implementing WriteStringValueSegment as defined here:
#67337 (comment)

dotnet-issue-labeler · 2024-04-21T20:30:50Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

src/libraries/System.Text.Json/src/System/Text/Json/JsonTokenType.cs

src/libraries/System.Text.Json/ref/System.Text.Json.cs

...aries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.Helpers.cs

src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.String.cs

src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.cs

eiriktsarpalis

For these changes to be useful, the built-in converters for string and byte[] would need to be updated to make use of the new APIs. This would involve having the conveter types derive from the internal JsonResumableConverter<T> which supports async serialization.

davidfowl · 2024-05-17T05:17:28Z

@eiriktsarpalis can we split those changes or do you want them to be in this PR as well?

eiriktsarpalis · 2024-05-17T07:41:20Z

can we split those changes or do you want them to be in this PR as well?

Are you asking if we can merge the PR without pending feedback being addressed? I don't think so, it's adding API that wasn't approved and the handling of split surrogate pairs/UTF-8 code points needs to be addressed as a matter of functional correctness and security.

eiriktsarpalis

Updating PR status to reflect its current state.

davidfowl · 2024-05-17T07:49:12Z

Asking about the converter part of the PR, not the initial API (wasn't referring to all of the feedback).

eiriktsarpalis · 2024-05-17T07:52:46Z

Ah yes, that is largely optional for the scope of this PR. (although we should definitely follow up with an implementation once this is merged because that's where you get the most impact from this feature)

src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.String.cs

src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.cs

src/libraries/System.Text.Encodings.Web/tests/SR.cs

src/libraries/System.Private.CoreLib/src/System/Text/Rune.cs

...libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.Helpers.cs

eiriktsarpalis · 2024-12-20T13:09:25Z

...libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.Helpers.cs

+
+        private void ClearPartialCodePoint() => PartialCodePointRaw[3] = 0;
+
+        private void WriteInvalidPartialCodePoint()


I'm assuming by "invalid" in this case you mean to say that it's incomplete? Isn't that already implied by using "partial"? If it actually was invalid, you wouldn't want to write it correct?

Looking at the method body I take it "invalid" means that it requires escaping? Is it possible to determine if a a partial code point requires escaping?

The caller only calls this when the code is invalid, but what this actually does is write replacement characters for the partial code point. By invalid, I mean that the previous write which set the partial code point was, say, UTF-8 and the new one is UTF-16, so this code point needs to be cleared. Maybe FlushPartialCodePoint or similar is a better name (though Utf8JsonWriter.Flush doesn't flush partial code points, so that could be confusing for a different reason).

I would probably go with something more verbose, e.g. WriteReplacementCharForInvalidPartialCodepoint

...es/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.Values.StringSegment.cs

eiriktsarpalis · 2024-12-20T15:25:54Z

...es/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.Values.StringSegment.cs

+                jsonUtf8.Flush();
+                JsonTestHelper.AssertContents("\"", output);
+
+                // Writing empty UTF-16 sequence will dump the partial UTF-8 code point


Curious why that happens. Naively, I would expect that an empty UTF-16 sequence should be a no-op.

Whenever encoding changes, we just dump the partial code point regardless of the content/size of what's being written (the empty string here), so this behavior is just consistent with that. We can special case this to check if the write is empty and non-final then it can be a no-op. One reasoning for the current behavior is if the user is just writing arbitrary strings then it might be confusing when some of them dump the partial code point and some don't until they track it down to the content.

I consider the whole scenario of switching encodings (especially when there's a partial code point) to be "undefined behavior" from a user's perspective. @GrabYourPitchforks mentioned that our other StreamReader/Writer APIs just misbehave when encodings change across operation.

Whenever encoding changes, we just dump the partial code point regardless of the content/size of what's being written (the empty string here), so this behavior is just consistent with that.

Sure, it just surprises me that this should happen in the event of a no-op call. No-ops should be just that. I'm not super comfortable with this hybrid mode if I'm honest, it seems to be introducing surprising behavior like this one. I would recommend just disabling it and requiring that the user always specifies the same encoding.

Is the fuzzer capable of randomly switching back and forth between encodings for a single string?

Is the fuzzer capable of randomly switching back and forth between encodings for a single string?

Yes, the fuzzer just gets an input of bytes that we can interpret any way we want, so we can implement this encoding switching behavior by just passing a span of the bytes into the writer if we want utf8 or marshalling a span of the bytes into chars and passing that into the writer for utf16. Currently the fuzzer test I added splits the data into 3 segments and does test switching encodings but it's easy enough to make it N splits where N is determined by the input.

Sure, it just surprises me that this should happen in the event of a no-op call.

In some sense, even if the default values are passed in for the argument (empty input span, isFinal = false), it still isn't a nop call because there is no default for the encoding and the user needs to select one by choosing the overload that they use.

I would recommend just disabling it and requiring that the user always specifies the same encoding.

Yeah, when encoding switches it will more likely be due to lack of validation or some other error by the caller and we should fail fast instead of silently continuing. If there is a desire for this in the future we should probably make it opt-in through JsonWriterOptions.

Co-authored-by: Eirik Tsarpalis <[email protected]>

…icator/writestringvaluesegment

…b.com/ificator/dotnet_runtime into user/ificator/writestringvaluesegment

MihaZupan · 2024-12-20T22:54:57Z

@MihuBot fuzz JsonWriter

Copilot reviewed 12 out of 27 changed files in this pull request and generated 1 comment.

Files not reviewed (15)

src/libraries/Fuzzing/DotnetFuzzing/DotnetFuzzing.csproj: Language not supported
src/libraries/System.Text.Encodings.Web/src/Resources/Strings.resx: Language not supported
src/libraries/System.Text.Encodings.Web/src/System.Text.Encodings.Web.csproj: Language not supported
src/libraries/System.Text.Encodings.Web/tests/System.Text.Encodings.Web.Tests.csproj: Language not supported
src/libraries/System.Text.Json/src/Resources/Strings.resx: Language not supported
src/libraries/System.Text.Json/src/System.Text.Json.csproj: Language not supported
src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.Helpers.cs: Evaluated as low risk
src/libraries/System.Text.Json/src/System/Text/Json/Reader/JsonReaderHelper.Unescaping.cs: Evaluated as low risk
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.cs: Evaluated as low risk
src/libraries/System.Text.Json/src/System/Text/Json/ThrowHelper.cs: Evaluated as low risk
src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.Comment.cs: Evaluated as low risk
src/libraries/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.Escaping.cs: Evaluated as low risk
src/libraries/System.Text.Json/ref/System.Text.Json.cs: Evaluated as low risk
src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.cs: Evaluated as low risk
eng/pipelines/libraries/fuzzing/deploy-to-onefuzz.yml: Evaluated as low risk

src/libraries/Fuzzing/DotnetFuzzing/Fuzzers/Utf8JsonWriterFuzzer.cs

...libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.Helpers.cs

src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.cs

davidfowl · 2024-12-23T23:32:49Z

Does this also handle the large base64 byte[] case?

PranavSenthilnathan · 2024-12-24T01:21:40Z

Does this also handle the large base64 byte[] case?

It does not, I'm planning on doing that as a follow up to this PR. The remaining work for this feature is:

support base64 in the writer
modify the built-in converters
- byte[] converter to use the base64 segmented api
- string converter to use the string segmented api

eiriktsarpalis · 2024-12-24T17:02:19Z

src/libraries/Fuzzing/DotnetFuzzing/Assert.cs

@@ -45,4 +45,24 @@ static void Throw(ReadOnlySpan<T> expected, ReadOnlySpan<T> actual)
            throw new Exception($"Expected={expected[diffIndex]} Actual={actual[diffIndex]} at index {diffIndex}");
        }
    }
+
+    public static void Throws<T, TState>(Action<TState> action, TState state)
+        where T : Exception


Suggested change

where T : Exception

where TException : Exception

eiriktsarpalis · 2024-12-24T17:03:10Z

src/libraries/Fuzzing/DotnetFuzzing/Assert.cs

+        }
+        catch (T)
+        {
+            return;


Consider returning the caught exception, since that's what existing Assert.Throws* methods typically do.

eiriktsarpalis · 2024-12-24T17:07:08Z

src/libraries/System.Text.Json/src/Resources/Strings.resx

+    <value>Cannot extract a Unicode scalar value from the specified index in the input.</value>
+  </data>
+  <data name="CannotMixEncodings" xml:space="preserve">
+    <value>Cannot mix encodings between string value segments. The previous segment's encoding was '{0}' and the current segment's encoding is '{1}'.</value>


Consider updating the wording here (and in the previous messages) to use "not supported" instead of "cannot":

Suggested change

<value>Cannot mix encodings between string value segments. The previous segment's encoding was '{0}' and the current segment's encoding is '{1}'.</value>

<value>Mixing UTF encodings in a single multi-segment JSON string is not supported. The previous segment's encoding was '{0}' and the current segment's encoding is '{1}'.</value>

eiriktsarpalis

Thanks for your contribution @ificator and @PranavSenthilnathan!

Implement WriteStringValueSegment defined in Issue 67337

9122ef8

dotnet-issue-labeler bot added area-System.Text.Json new-api-needs-documentation labels Apr 21, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 21, 2024

teo-tsirpanis reviewed Apr 21, 2024

View reviewed changes

src/libraries/System.Text.Json/src/System/Text/Json/JsonTokenType.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed Apr 22, 2024

View reviewed changes

src/libraries/System.Text.Json/ref/System.Text.Json.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed Apr 22, 2024

View reviewed changes

...aries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.Helpers.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed Apr 22, 2024

View reviewed changes

src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.String.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed Apr 22, 2024

View reviewed changes

src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.String.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed Apr 22, 2024

View reviewed changes

src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed Apr 22, 2024

View reviewed changes

This was referenced Apr 22, 2024

[Test Failure] SocketsHttpHandler_HttpClientHandler_AltSvc_Test_Http3.AltSvc_ConnectionFrame_UpgradeFrom20_Success #101376

Closed

[Test Failure] SocketsHttpHandlerTest_Cookies_Http3.GetAsync_SetCookieContainerAndCookieHeader_BothCookiesSent #101377

Closed

eiriktsarpalis requested changes May 17, 2024

View reviewed changes

ificator added 3 commits May 26, 2024 14:02

Fix some review comments

e044b13

merge upstream/main

e7abe7f

Handle split surrogate pair

b8d578c

eiriktsarpalis reviewed May 29, 2024

View reviewed changes

src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.String.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed May 29, 2024

View reviewed changes

src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed May 29, 2024

View reviewed changes

src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.cs Outdated Show resolved Hide resolved

eiriktsarpalis mentioned this pull request Jun 7, 2024

IndexOutOfRangeException serializing large strings #103155

Open