Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithm to reverse a string should account for extended grapheme clusters #299

Closed
2 tasks done
kgryte opened this issue May 14, 2020 · 6 comments
Closed
2 tasks done
Labels
Bug Something isn't working. Help Wanted Extra attention is needed.

Comments

@kgryte
Copy link
Member

kgryte commented May 14, 2020

Checklist

Please ensure the following tasks are completed before filing a bug report.

  • Read and understood the Code of Conduct.
  • Searched for existing issues and pull requests.

Description

Description of the issue.

Encountered an error when attempting to use @stdlib/string/reverse to reverse a string which includes extended grapheme clusters. While the implementation does account for Unicode surrogate pairs and combining marks, extended grapheme clusters do not appear supported.

Addressing this issue may require using a Unicode segmentation algorithm, as discussed in the Unicode standard. One can find reference implementations in Rust and elsewhere.

Related Issues

Does this issue have any related issues?

Related issues #295.

Questions

Any questions for reviewers?

No.

Other

Any other information relevant to this issue? This may include screenshots, references, stack traces, sample output, and/or implementation notes.

N/A

Demo

If relevant, provide a link to a live demo.

N/A

Reproduction

What steps are required to reproduce the unexpected output?

In order to reproduce this bug, do the following:

  • attempt to reverse a string which includes extended grapheme clusters.

Expected Results

What are the expected results?

The reversed string to account for extended grapheme clusters.

Actual Results

What are the actual results?

N/A

Environments

What environments are affected (e.g., Node v0.4.x, Chrome, IE 11)? If Node.js, include the npm version, operating system, and any other potentially relevant platform information.

The following environments are affected:

  • all
@kgryte kgryte added Bug Something isn't working. Help Wanted Extra attention is needed. labels May 14, 2020
@gardhr
Copy link

gardhr commented Jun 18, 2020

Try something like this:


function reverse(str) {
  var result = "";
  var smx = str.length;
  var sdx = 0;
  /*
   Can't start at the end because each segment depends on the 
   previous one. So build up a list of lengths first instead. 
*/
  var counts = [];
  for (;;) {
    var skip = str.codePointAt(sdx) <= 0xffff ? 1 : 2;
    counts.push(skip);
    sdx += skip;
    if (sdx >= smx) break;
  }

  var cmx = counts.length;
  while (cmx--) {
    var skip = counts[cmx];
    /*
   Now we know how much to hop backward for the next grapheme.
   Write the first segment, then the second (if present).
*/
    var hop = skip;
    while (hop) result += str[smx - hop--];

    /*
     Move on to the next glyph. (Bounds-checking won't be needed
     here since the counts array holds an accurate list of widths.)
*/
    smx -= skip;
  }
  return result;
}

@kgryte
Copy link
Member Author

kgryte commented Jun 19, 2020

@gardhr Thanks for the suggestion! Feel free to submit a PR along with tests. :)

Not sure how well your proposal accounts for all graphemes, as it has been a while since I last looked into it.

@gardhr
Copy link

gardhr commented Jun 19, 2020

Sorry, I just don't have the time at the moment. It has worked well for me over the years, I would be rather surprised if it choked on anything at all. If it does however, just give me a set of failing samples and I'll be glad to fix it.

Wish I could do more to help. Maybe once I get passed some of this backlog?

var print = console.log;
var text =
  "外国語の学習と教授Language Learning and TeachingИзучение и обучение иностранных языков語文教學・语文教学Enseñanza y estudio de idiomasИзучаване и Преподаване на Чужди Езициქართული ენის შესწავლა და სწავლება'læŋɡwidʒ 'lɘr:niŋ ænd 'ti:tʃiŋLus kawm thaib qhiaNgôn Ngữ, Sự học,말배우기와 가르치기Nauka języków obcychΓλωσσική Εκμὰθηση και Διδασκαλίαเรียนและสอนภาษา語文教學 😃, ქართული 😭, Εκμὰθηση 😈 иностранных𠀋𠀾𠁎𠁨☃★♲\u{2F804}";
var reversed = reverse(text);
print(text);
print();
print(reversed);
print();
print("Test:", reverse(reversed) == text ? "Pass" : "FAIL!");

Output:

外国語の学習と教授Language Learning and TeachingИзучение и обучение иностранных языков語文教學・语文教学Enseñanza y estudio de idiomasИзучаване и Преподаване на Чужди Езициქართული ენის შესწავლა და სწავლება'læŋɡwidʒ 'lɘr:niŋ ænd 'ti:tʃiŋLus kawm thaib qhiaNgôn Ngữ, Sự học,말배우기와 가르치기Nauka języków obcychΓλωσσική Εκμὰθηση και Διδασκαλίαเรียนและสอนภาษา語文教學 😃, ქართული 😭, Εκμὰθηση 😈 иностранных𠀋𠀾𠁎𠁨☃★♲你

你♲★☃𠁨𠁎𠀾𠀋хыннартсони 😈 ησηθὰμκΕ ,😭 ილუთრაქ ,😃 學教文語าษาภนอสะลแนยีรเαίλακσαδιΔ ιακ ησηθὰμκΕ ήκισσωλΓhcycbo wókyzęj akuaN기치르가 와기우배말,cọh ựS ,ữgN nôgNaihq biaht mwak suLŋiʃt:it' dnæ ŋin:rɘl' ʒdiwɡŋæl'აბელვაწს ად ალვაწსეშ სინე ილუთრაქицизЕ иджуЧ ан енавадоперП и енавачузИsamoidi ed oidutse y aznañesnE学教文语・學教文語вокызя хыннартсони еинечубо и еинечузИgnihcaeT dna gninraeL egaugnaL授教と習学の語国外

Test: Pass

@kgryte
Copy link
Member Author

kgryte commented Jun 19, 2020

@gardhr Thanks for the demo!

@gardhr
Copy link

gardhr commented Jun 19, 2020

Yep, you're welcome.

ShraddheyaS added a commit to ShraddheyaS/stdlib that referenced this issue Sep 14, 2020
@kgryte
Copy link
Member Author

kgryte commented Nov 11, 2023

This was resolved in #1082.

@kgryte kgryte closed this as completed Nov 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working. Help Wanted Extra attention is needed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants