Use UTF-32 Aware String in JavaScript #13

sffc · 2018-05-22T14:24:50Z

As noted in #10, standard JavaScript strings use UTF-16 encoding, which leads to false arithmetic for a tool like diff_match_patch. The only correct fixes to the problem would be to either rewrite the whole library to operate in offsets instead of char indices or to push the character encoding into an abstraction layer. This PR attempts to do the second option.

At every API boundary, the input string is checked to see if it contains supplemental code points, and if it does, the inputs are converted to utf32_string, an abstraction layer that has a String-like API but does all math in terms of code points instead of code units.

The utf32_string implementation is slower than the standard string implementation, but not prohibitively slower, and it is used only when required (when the string contains supplemental code points).

I tried my best to be consistent with the JavaScript style of the existing project, including pre-ES6 syntax. I did not add Closure type annotations.

…s to use functions instead of operators that coerce arguments to primitive types. Checks or converts to utf32_string at the API boundary. Changes unit test suite to be run twice, one with standard string and once with forced utf32_string.

# Conflicts: # javascript/diff_match_patch_uncompressed.js # javascript/tests/diff_match_patch_test.js

sffc · 2018-05-22T14:26:45Z

javascript/diff_match_patch_uncompressed.js

+
+
+diff_match_patch.utf32_string.prototype.valueOf = function() {
+  throw new Error("warning: implicit conversion attempted on utf32_string; use toString instead");


console.error instead of thrown error?

sffc · 2018-06-09T05:09:42Z

@NeilFraser Initial thoughts on this?

sffc · 2018-08-31T00:55:58Z

@NeilFraser Hello?

EmilioDiez · 2019-02-15T08:15:32Z

Hi! thank you for providing the patch.
I have tested this patch and works for me when I use escaped unicode e.g. '\uD800\uDDE4' as you have on your tests. However if I paste the following code on the chrome console

 var dmp = new diff_match_patch();
 var patches = dmp.patch_make('👏test', '👏');
 var patchText = dmp.patch_toText(patches);
 var result = dmp.patch_fromText(patchText);
 var obj = dmp.patch_apply(result, '👏test');

I get this exception:

diff_match_patch.1.0.1.js:2368 Uncaught TypeError: text.codePointAt is not a function
    at Function.diff_match_patch.utf32_string.from (diff_match_patch.1.0.1.js:2368)
    at diff_match_patch.diff_main (diff_match_patch.1.0.1.js:113)
    at diff_match_patch.patch_apply (diff_match_patch.1.0.1.js:1968)
    at <anonymous>:4:17

If you know any workaround around this issue, please could you share it here.

sffc · 2019-02-15T11:34:55Z

Probably what's happening is diff_match_patch.utf32_string.from is getting passed text that is already a utf32_string. Try adding a line to the top of that function that leaves if it is already a utf32_string.

  if (text.codePoints) {
    // Argument is another utf32_string; do not re-convert
    return text;
  }

EmilioDiez · 2019-02-15T17:17:23Z

Thank you very much,

The exception is gone 😄 but I still have some Unicode strings that give errors.
To reproduce, paste the following code on the chrome console:

var dmp = new diff_match_patch();
var str = '👏👏t👏'; var patches = dmp.patch_make(str, 'test👏');
var patchText = dmp.patch_toText(patches);
var _patches = dmp.patch_fromText(patchText);
var obj = dmp.patch_apply(_patches, str);
console.log(obj[0]);

it will output this "mojibake"

"tes�t👏"

sffc · 2019-02-15T20:21:44Z

Hi, thanks for finding these issues. I don't plan to spend any more time on my PR since it hasn't gotten any attention in 6 months. I encourage you to copy my branch into your own fork (pull it from my fork and push it to your fork) and then add your own commits where you fix these errors. You can then open a new PR and hopefully Neil will find some time one of these days to review it.

EmilioDiez · 2019-02-16T03:03:09Z

Ok, no worries 😢 I will report back if I manage to fix the issue. Thank you again for the patch.

ndvbd · 2021-05-24T18:20:07Z

Any updates?

dmsnell · 2021-05-24T21:35:20Z

@ndvbd you can check out the branch in #80. it doesn't have fixes for all languages but it does for JavaScript. It's also only patched for toDelta since that's the function where we have the issue splitting pairs (since deltas are encoded as UTF-8 with percent-encoding).

We've been using it for more than a year in Simplenote and it's working fine. Previously we had some corruption issues that popped up on account of the issue with surrogate pairs.

ndvbd · 2021-05-25T16:08:56Z

@dmsnell any idea why it is not merged into main?

sffc · 2021-05-25T16:26:38Z

Looks like there are now at least 3 independent PRs with various approaches to fixing the issue: this one, #69, and #80. @NeilFraser Can you take a look at these and perhaps merge one?

dmsnell · 2021-05-25T17:54:47Z

@ndvbd I would wager that @NeilFraser is busy with other things.

@sffc this approach is fairly heavy-handed - did you run it against the performance tests? what were the costs? How do you feel about converting to UTF-32 vs. the other approaches?

Per the comments in #69 I think that approach sounds better than it is in practice. The solution in #80 only addresses the delta generation (which is "fine" because this library doesn't specify that it produces Unicode byte streams/diffs) and it only imparts a marginal performance penalty and that only when surrogate halves are split. The only thing I see it lacking is the boundary cleanup as seen in the dissimilar because that normalizes all diff outputs. I haven't done that simply because it's working stably enough in #80 and I moved on to other problems.

Your feedback on those issues is welcome. Regardless of what takes place in master I'd like to see us all merge our ideas for resolving this and other encoding-related issues in diff-match-patch.

sffc · 2021-05-25T18:10:40Z

I'm happy if #69 is enough to solve the issue; it's certainly a lot smaller of a patch than this one. There are a lot of code paths; I think the key thing with any of these is a thorough test suite.

gamedevsam · 2022-08-24T04:15:31Z

I've tried all 3 solutions to this problem and yours is the only one that solved the encodingURI exception for me. I created a minimal repo which reproduces the issue: https://github.com/gamedevsam/diff-match-patch-emoji-issue

Hope it's helpful to the developers of the other 2 alternate solutions. Thank you for your contributions!

sffc added 3 commits May 22, 2018 06:59

Merge branch 'master' of https://github.com/google/diff-match-patch

7e90d11

# Conflicts: # javascript/diff_match_patch_uncompressed.js # javascript/tests/diff_match_patch_test.js

Reset test success/failure counters for each run.

33dad20

sffc commented May 22, 2018

View reviewed changes

FranklinYu mentioned this pull request May 23, 2018

Continuous Integration #3

Open

josephrocca mentioned this pull request Aug 18, 2019

Diff breaks unicode characters for emojis #59

Open

This was referenced Aug 24, 2022

JS: Handle surrogate pairs correctly #69

Open

Stop breaking surrogate pairs in toDelta()/fromDelta() #80

Open

sffc mentioned this pull request Dec 19, 2022

Inconsistency between iterating over byte and char indices servo/unicode-bidi#86

Open

sffc mentioned this pull request Apr 17, 2024

ICU4C: fixing some date/time format problems unicode-org/conformance#223

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use UTF-32 Aware String in JavaScript #13

Use UTF-32 Aware String in JavaScript #13

sffc commented May 22, 2018

sffc May 22, 2018

sffc commented Jun 9, 2018

sffc commented Aug 31, 2018

EmilioDiez commented Feb 15, 2019

sffc commented Feb 15, 2019

EmilioDiez commented Feb 15, 2019 •

edited

Loading

sffc commented Feb 15, 2019

EmilioDiez commented Feb 16, 2019

ndvbd commented May 24, 2021

dmsnell commented May 24, 2021

ndvbd commented May 25, 2021

sffc commented May 25, 2021

dmsnell commented May 25, 2021

sffc commented May 25, 2021

gamedevsam commented Aug 24, 2022 •

edited

Loading



		diff_match_patch.utf32_string.prototype.valueOf = function() {
		throw new Error("warning: implicit conversion attempted on utf32_string; use toString instead");

Use UTF-32 Aware String in JavaScript #13

Are you sure you want to change the base?

Use UTF-32 Aware String in JavaScript #13

Conversation

sffc commented May 22, 2018

sffc May 22, 2018

Choose a reason for hiding this comment

sffc commented Jun 9, 2018

sffc commented Aug 31, 2018

EmilioDiez commented Feb 15, 2019

sffc commented Feb 15, 2019

EmilioDiez commented Feb 15, 2019 • edited Loading

sffc commented Feb 15, 2019

EmilioDiez commented Feb 16, 2019

ndvbd commented May 24, 2021

dmsnell commented May 24, 2021

ndvbd commented May 25, 2021

sffc commented May 25, 2021

dmsnell commented May 25, 2021

sffc commented May 25, 2021

gamedevsam commented Aug 24, 2022 • edited Loading

EmilioDiez commented Feb 15, 2019 •

edited

Loading

gamedevsam commented Aug 24, 2022 •

edited

Loading