-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement WriteStringValueSegment defined in Issue 67337 #101356
base: main
Are you sure you want to change the base?
Implement WriteStringValueSegment defined in Issue 67337 #101356
Conversation
Note regarding the
|
src/libraries/System.Text.Json/src/System/Text/Json/JsonTokenType.cs
Outdated
Show resolved
Hide resolved
...aries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.Helpers.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.String.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.String.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these changes to be useful, the built-in converters for string
and byte[]
would need to be updated to make use of the new APIs. This would involve having the conveter types derive from the internal JsonResumableConverter<T>
which supports async serialization.
@eiriktsarpalis can we split those changes or do you want them to be in this PR as well? |
Are you asking if we can merge the PR without pending feedback being addressed? I don't think so, it's adding API that wasn't approved and the handling of split surrogate pairs/UTF-8 code points needs to be addressed as a matter of functional correctness and security. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating PR status to reflect its current state.
Asking about the converter part of the PR, not the initial API (wasn't referring to all of the feedback). |
Ah yes, that is largely optional for the scope of this PR. (although we should definitely follow up with an implementation once this is merged because that's where you get the most impact from this feature) |
src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.String.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.Helpers.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.Helpers.cs
Outdated
Show resolved
Hide resolved
|
||
private void ClearPartialCodePoint() => PartialCodePointRaw[3] = 0; | ||
|
||
private void WriteInvalidPartialCodePoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming by "invalid" in this case you mean to say that it's incomplete? Isn't that already implied by using "partial"? If it actually was invalid, you wouldn't want to write it correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the method body I take it "invalid" means that it requires escaping? Is it possible to determine if a a partial code point requires escaping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller only calls this when the code is invalid, but what this actually does is write replacement characters for the partial code point. By invalid, I mean that the previous write which set the partial code point was, say, UTF-8 and the new one is UTF-16, so this code point needs to be cleared. Maybe FlushPartialCodePoint
or similar is a better name (though Utf8JsonWriter.Flush doesn't flush partial code points, so that could be confusing for a different reason).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably go with something more verbose, e.g. WriteReplacementCharForInvalidPartialCodepoint
...es/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.Values.StringSegment.cs
Show resolved
Hide resolved
...es/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.Values.StringSegment.cs
Outdated
Show resolved
Hide resolved
...es/System.Text.Json/tests/System.Text.Json.Tests/Utf8JsonWriterTests.Values.StringSegment.cs
Outdated
Show resolved
Hide resolved
jsonUtf8.Flush(); | ||
JsonTestHelper.AssertContents("\"", output); | ||
|
||
// Writing empty UTF-16 sequence will dump the partial UTF-8 code point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why that happens. Naively, I would expect that an empty UTF-16 sequence should be a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever encoding changes, we just dump the partial code point regardless of the content/size of what's being written (the empty string here), so this behavior is just consistent with that. We can special case this to check if the write is empty and non-final then it can be a no-op. One reasoning for the current behavior is if the user is just writing arbitrary strings then it might be confusing when some of them dump the partial code point and some don't until they track it down to the content.
I consider the whole scenario of switching encodings (especially when there's a partial code point) to be "undefined behavior" from a user's perspective. @GrabYourPitchforks mentioned that our other StreamReader/Writer APIs just misbehave when encodings change across operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever encoding changes, we just dump the partial code point regardless of the content/size of what's being written (the empty string here), so this behavior is just consistent with that.
Sure, it just surprises me that this should happen in the event of a no-op call. No-ops should be just that. I'm not super comfortable with this hybrid mode if I'm honest, it seems to be introducing surprising behavior like this one. I would recommend just disabling it and requiring that the user always specifies the same encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fuzzer capable of randomly switching back and forth between encodings for a single string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fuzzer capable of randomly switching back and forth between encodings for a single string?
Yes, the fuzzer just gets an input of bytes that we can interpret any way we want, so we can implement this encoding switching behavior by just passing a span of the bytes into the writer if we want utf8 or marshalling a span of the bytes into chars and passing that into the writer for utf16. Currently the fuzzer test I added splits the data into 3 segments and does test switching encodings but it's easy enough to make it N splits where N is determined by the input.
Sure, it just surprises me that this should happen in the event of a no-op call.
In some sense, even if the default values are passed in for the argument (empty input span, isFinal = false), it still isn't a nop call because there is no default for the encoding and the user needs to select one by choosing the overload that they use.
I would recommend just disabling it and requiring that the user always specifies the same encoding.
Yeah, when encoding switches it will more likely be due to lack of validation or some other error by the caller and we should fail fast instead of silently continuing. If there is a desire for this in the future we should probably make it opt-in through JsonWriterOptions
.
Co-authored-by: Eirik Tsarpalis <[email protected]>
…icator/writestringvaluesegment
…b.com/ificator/dotnet_runtime into user/ificator/writestringvaluesegment
@MihuBot fuzz JsonWriter |
Copilot
AI
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 12 out of 27 changed files in this pull request and generated 1 comment.
Files not reviewed (15)
- src/libraries/Fuzzing/DotnetFuzzing/DotnetFuzzing.csproj: Language not supported
- src/libraries/System.Text.Encodings.Web/src/Resources/Strings.resx: Language not supported
- src/libraries/System.Text.Encodings.Web/src/System.Text.Encodings.Web.csproj: Language not supported
- src/libraries/System.Text.Encodings.Web/tests/System.Text.Encodings.Web.Tests.csproj: Language not supported
- src/libraries/System.Text.Json/src/Resources/Strings.resx: Language not supported
- src/libraries/System.Text.Json/src/System.Text.Json.csproj: Language not supported
- src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.Helpers.cs: Evaluated as low risk
- src/libraries/System.Text.Json/src/System/Text/Json/Reader/JsonReaderHelper.Unescaping.cs: Evaluated as low risk
- src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.cs: Evaluated as low risk
- src/libraries/System.Text.Json/src/System/Text/Json/ThrowHelper.cs: Evaluated as low risk
- src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.Comment.cs: Evaluated as low risk
- src/libraries/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.Escaping.cs: Evaluated as low risk
- src/libraries/System.Text.Json/ref/System.Text.Json.cs: Evaluated as low risk
- src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.cs: Evaluated as low risk
- eng/pipelines/libraries/fuzzing/deploy-to-onefuzz.yml: Evaluated as low risk
...libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteValues.Helpers.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.cs
Show resolved
Hide resolved
Does this also handle the large base64 byte[] case? |
It does not, I'm planning on doing that as a follow up to this PR. The remaining work for this feature is:
|
@@ -45,4 +45,24 @@ static void Throw(ReadOnlySpan<T> expected, ReadOnlySpan<T> actual) | |||
throw new Exception($"Expected={expected[diffIndex]} Actual={actual[diffIndex]} at index {diffIndex}"); | |||
} | |||
} | |||
|
|||
public static void Throws<T, TState>(Action<TState> action, TState state) | |||
where T : Exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where T : Exception | |
where TException : Exception |
} | ||
catch (T) | ||
{ | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider returning the caught exception, since that's what existing Assert.Throws*
methods typically do.
<value>Cannot extract a Unicode scalar value from the specified index in the input.</value> | ||
</data> | ||
<data name="CannotMixEncodings" xml:space="preserve"> | ||
<value>Cannot mix encodings between string value segments. The previous segment's encoding was '{0}' and the current segment's encoding is '{1}'.</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider updating the wording here (and in the previous messages) to use "not supported" instead of "cannot":
<value>Cannot mix encodings between string value segments. The previous segment's encoding was '{0}' and the current segment's encoding is '{1}'.</value> | |
<value>Mixing UTF encodings in a single multi-segment JSON string is not supported. The previous segment's encoding was '{0}' and the current segment's encoding is '{1}'.</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @ificator and @PranavSenthilnathan!
This is a quick attempt at implementing
WriteStringValueSegment
as defined here:#67337 (comment)