Skip to content
This repository has been archived by the owner on Aug 5, 2024. It is now read-only.

Use UTF-32 Aware String in JavaScript #13

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sffc
Copy link

@sffc sffc commented May 22, 2018

As noted in #10, standard JavaScript strings use UTF-16 encoding, which leads to false arithmetic for a tool like diff_match_patch. The only correct fixes to the problem would be to either rewrite the whole library to operate in offsets instead of char indices or to push the character encoding into an abstraction layer. This PR attempts to do the second option.

At every API boundary, the input string is checked to see if it contains supplemental code points, and if it does, the inputs are converted to utf32_string, an abstraction layer that has a String-like API but does all math in terms of code points instead of code units.

The utf32_string implementation is slower than the standard string implementation, but not prohibitively slower, and it is used only when required (when the string contains supplemental code points).

I tried my best to be consistent with the JavaScript style of the existing project, including pre-ES6 syntax. I did not add Closure type annotations.

sffc added 3 commits May 22, 2018 06:59
…s to use functions instead of operators that coerce arguments to primitive types. Checks or converts to utf32_string at the API boundary. Changes unit test suite to be run twice, one with standard string and once with forced utf32_string.
# Conflicts:
#	javascript/diff_match_patch_uncompressed.js
#	javascript/tests/diff_match_patch_test.js


diff_match_patch.utf32_string.prototype.valueOf = function() {
throw new Error("warning: implicit conversion attempted on utf32_string; use toString instead");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

console.error instead of thrown error?

@sffc
Copy link
Author

sffc commented Jun 9, 2018

@NeilFraser Initial thoughts on this?

@sffc
Copy link
Author

sffc commented Aug 31, 2018

@NeilFraser Hello?

@EmilioDiez
Copy link

Hi! thank you for providing the patch.
I have tested this patch and works for me when I use escaped unicode e.g. '\uD800\uDDE4' as you have on your tests. However if I paste the following code on the chrome console

 var dmp = new diff_match_patch();
 var patches = dmp.patch_make('👏test', '👏');
 var patchText = dmp.patch_toText(patches);
 var result = dmp.patch_fromText(patchText);
 var obj = dmp.patch_apply(result, '👏test');

I get this exception:

diff_match_patch.1.0.1.js:2368 Uncaught TypeError: text.codePointAt is not a function
    at Function.diff_match_patch.utf32_string.from (diff_match_patch.1.0.1.js:2368)
    at diff_match_patch.diff_main (diff_match_patch.1.0.1.js:113)
    at diff_match_patch.patch_apply (diff_match_patch.1.0.1.js:1968)
    at <anonymous>:4:17

If you know any workaround around this issue, please could you share it here.

@sffc
Copy link
Author

sffc commented Feb 15, 2019

Probably what's happening is diff_match_patch.utf32_string.from is getting passed text that is already a utf32_string. Try adding a line to the top of that function that leaves if it is already a utf32_string.

  if (text.codePoints) {
    // Argument is another utf32_string; do not re-convert
    return text;
  }

@EmilioDiez
Copy link

EmilioDiez commented Feb 15, 2019

Thank you very much,

The exception is gone 😄 but I still have some Unicode strings that give errors.
To reproduce, paste the following code on the chrome console:

var dmp = new diff_match_patch();
var str = '👏👏t👏'; var patches = dmp.patch_make(str, 'test👏');
var patchText = dmp.patch_toText(patches);
var _patches = dmp.patch_fromText(patchText);
var obj = dmp.patch_apply(_patches, str);
console.log(obj[0]);

it will output this "mojibake"

"tes�t👏"

@sffc
Copy link
Author

sffc commented Feb 15, 2019

Hi, thanks for finding these issues. I don't plan to spend any more time on my PR since it hasn't gotten any attention in 6 months. I encourage you to copy my branch into your own fork (pull it from my fork and push it to your fork) and then add your own commits where you fix these errors. You can then open a new PR and hopefully Neil will find some time one of these days to review it.

@EmilioDiez
Copy link

Ok, no worries 😢 I will report back if I manage to fix the issue. Thank you again for the patch.

@ndvbd
Copy link

ndvbd commented May 24, 2021

Any updates?

@dmsnell
Copy link

dmsnell commented May 24, 2021

@ndvbd you can check out the branch in #80. it doesn't have fixes for all languages but it does for JavaScript. It's also only patched for toDelta since that's the function where we have the issue splitting pairs (since deltas are encoded as UTF-8 with percent-encoding).

We've been using it for more than a year in Simplenote and it's working fine. Previously we had some corruption issues that popped up on account of the issue with surrogate pairs.

@ndvbd
Copy link

ndvbd commented May 25, 2021

@dmsnell any idea why it is not merged into main?

@sffc
Copy link
Author

sffc commented May 25, 2021

Looks like there are now at least 3 independent PRs with various approaches to fixing the issue: this one, #69, and #80. @NeilFraser Can you take a look at these and perhaps merge one?

@dmsnell
Copy link

dmsnell commented May 25, 2021

@ndvbd I would wager that @NeilFraser is busy with other things.

@sffc this approach is fairly heavy-handed - did you run it against the performance tests? what were the costs? How do you feel about converting to UTF-32 vs. the other approaches?

Per the comments in #69 I think that approach sounds better than it is in practice. The solution in #80 only addresses the delta generation (which is "fine" because this library doesn't specify that it produces Unicode byte streams/diffs) and it only imparts a marginal performance penalty and that only when surrogate halves are split. The only thing I see it lacking is the boundary cleanup as seen in the dissimilar because that normalizes all diff outputs. I haven't done that simply because it's working stably enough in #80 and I moved on to other problems.

Your feedback on those issues is welcome. Regardless of what takes place in master I'd like to see us all merge our ideas for resolving this and other encoding-related issues in diff-match-patch.

@sffc
Copy link
Author

sffc commented May 25, 2021

I'm happy if #69 is enough to solve the issue; it's certainly a lot smaller of a patch than this one. There are a lot of code paths; I think the key thing with any of these is a thorough test suite.

@gamedevsam
Copy link

gamedevsam commented Aug 24, 2022

I've tried all 3 solutions to this problem and yours is the only one that solved the encodingURI exception for me. I created a minimal repo which reproduces the issue: https://github.com/gamedevsam/diff-match-patch-emoji-issue

Hope it's helpful to the developers of the other 2 alternate solutions. Thank you for your contributions!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants