Unicode support for word boundary `\b` #228

gondalez · 2018-03-09T02:20:55Z

Is it possible to extend the unicode support to the word boundary anchor?

For example the russian sentence cannot be split:

"hello there this is a test".split(XRegExp('\\b', 'A'))
(11) ["hello", " ", "there", " ", "this", " ", "is", " ", "a", " ", "test"]

"Сняли не первый раз изначальную и конечную сумму и начальную не вернули !!!".split(XRegExp('\\b', 'A'))
["Сняли не первый раз изначальную и конечную сумму и начальную не вернули !!!"]

^ note the split has no effect on russian

The equivalent and desired behaviour in ruby, for example:

irb(main):001:0> "hello there this is a test".split(/\b/)
[
  "hello",
  " ",
  "there",
  " ",
  "this",
  " ",
  "is",
  " ",
  "a",
  " ",
  "test"
]
irb(main):002:0> "Сняли не первый раз изначальную и конечную сумму и начальную не вернули !!!".split(/\b/)
[
  "Сняли",
  " ",
  "не",
  " ",
  "первый",
  " ",
  "раз",
  " ",
  "изначальную",
  " ",
  "и",
  " ",
  "конечную",
  " ",
  "сумму",
  " ",
  "и",
  " ",
  "начальную",
  " ",
  "не",
  " ",
  "вернули",
  " !!!"
]

The text was updated successfully, but these errors were encountered:

slevithan · 2018-03-09T05:10:10Z

Unfortunately, emulating Unicode word boundaries would require native lookbehind support, which is only just being added to the JS spec in EcmaScript 2018. When support spreads to all modern browsers, it will be possible to take this on.

gondalez · 2018-03-12T00:15:30Z

No problem, thanks for the explanation @slevithan 👍

gausie · 2021-03-31T10:44:11Z

@slevithan can this be implemented now? Is this already available?

slevithan · 2021-03-31T21:30:01Z

Yes, this is possible now in ES2018 environments.

But first you need to define what a Unicode word character is. I'll use the rough approximation \p{L}\p{M}*, which matches any Unicode letter followed by any number of Unicode combining marks.

That leads to the following way to emulate a Unicode-aware word boundary (\b):

(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))

Or breaking it down with XRegExp-style free spacing and comments to explain it:

# Either:
(?:
  # The position is preceded by a Unicode word character
  (?<= \p{L}\p{M}* )
  # And the same position is not followed by a Unicode word character
  (?!  \p{L}\p{M}* )
# Or:
|
  # The position is not preceded by a Unicode word character
  (?<! \p{L}\p{M}* )
  # And the same position is followed by a Unicode word character
  (?=  \p{L}\p{M}* )
)

And here's how to emulate a Unicode-aware non-word-boundary (\B):

(?:(?<=\p{L}\p{M}*)(?=\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?!\p{L}\p{M}*))

If you wanted to add support for Unicode aware \b to XRegExp and hide it behind XRegExp's existing A (astral) flag, you could do the following:

XRegExp.addToken(
  /\\b/,
  () => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
  {flag: 'A'}
);

Or if you also wanted to support inverse Unicode word boundaries (\b and \B):

XRegExp.addToken(
  /\\([bB])/,
  (match) => {
    const inverse = match[1] === 'B';
    return inverse ?
      String.raw`(?:(?<=\p{L}\p{M}*)(?=\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?!\p{L}\p{M}*))` :
      String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`;
  },
  {flag: 'A'}
);

Alternatively, you could avoid overloading the A flag and instead give this handling its own flag, such as b. That would just require changing {flag: 'A'} to {flag: 'b'} in the code above.

Note that by not specifying a scope for the tokens added, we're using default scope. That means that \b and \B will only be transformed when they are used outside of character classes ([...]). This is intentional, since \b has a different meaning within character classes in standard JS (it matches a backspace character), and \b or \B within character classes is an error in XRegExp.

Heads up that this is untested. Also heads up that \p{...} doesn't have the intended meaning in ES2018 native regexes unless using flag u, so after adding the above XRegExp tokens you'd have to use flags A and u with your regex to make it work (e.g., XRegExp.tag('Au')`\b` or XRegExp(String.raw`\b`, 'Au'). That's fine if you always remember to use both, but there are two ways you could further improve that to avoid the problem if you forget:

Make it an error to use \b or \B with flag A unless flag u is also present (by checking for flag u within the token handler function shown above, and throwing an error if it's not present).
Use XRegExp.addToken's reparse option. This will lead to XRegExp handling/parsing the generated \p{L}\p{M} tokens in the output, rather than deferring to native syntax. That should resolve the issue since XRegExp doesn't need flag u to transform \p{...} tokens into syntax supported by native regexes (with or without flag u).

I don't expect to add built-in support for Unicode word boundaries to XRegExp in the short term, but hopefully the details above are enough to add support within your own code.

OultimoCoder · 2023-06-26T09:05:15Z

Thanks so much for the above code! Would love to get inbuilt support for this in the future!

mgoldenbe · 2024-01-30T11:59:40Z

XRegExp.addToken(
  /\\b/,
  () => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
  {flag: 'A'}
);

This did not work for me. Here is my code:

XRegExp = require('xregexp')
base = require('xregexp/lib/addons/unicode-base')
base(XRegExp)
XRegExp.addToken(
  /\\b/,
  () => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
  {flag: 'A'}
);
console.log(XRegExp.exec("ааа бб вв", XRegExp(/\bбб\b/), "uA")) // null

What am I doing wrong?

slevithan · 2024-01-31T19:02:07Z

@mgoldenbe the code works fine but you are incorrectly passing "uA" as a third argument to XRegExp.exec rather than as the second (flags) argument to the XRegExp constructor.

However, I prepared a long reply about an additional issue based on my initial misreading of your б characters (U+0431, Cyrillic Small Letter Be) as sixes. So I'll go ahead and include it below even though you might not need it.

Your code above is working as intended. See:

XRegExp.addToken(
  /\\b/,
  () => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
  {flag: 'A'}
);
const nativeWordBoundary = /\bXX\b/;
const unicodeLetterBoundary = XRegExp.tag('Au')`\bXX\b`;

nativeWordBoundary.test('愛XX愛'); // true
unicodeLetterBoundary.test('愛XX愛'); // false
unicodeLetterBoundary.test('XX'); // true

However, it seems you missed this from my comment above:

But first you need to define what a Unicode word character is. I'll use the rough approximation \p{L}\p{M}*, which matches any Unicode letter followed by any number of Unicode combining marks.

Note that native JS regex word boundaries treat ASCII letters, ASCII numbers, and underscore as "word characters". But above I defined a word character merely as a close approximation of a complete Unicode letter. I did not include any numbers (ASCII or otherwise) or underscore.

Based on your above code where you expected the number "6" to be treated as a word character, I'm guessing this was not the definition of "word character" you were looking for. You can change it to anything you want while following the overall code in my comment.

For example, here's a slight modification of my example code that supports Unicode-aware versions of both \b and \B (behind flags 'Au') and that treats any Unicode letter, Unicode number, or underscore as a word character:

XRegExp.addToken(
  /\\([bB])/,
  (match) => {
    const inverse = match[1] === 'B';
    const unicodeLetter = String.raw`\p{L}\p{M}*`;
    const unicodeNumber = String.raw`\p{N}`;
    const other = '_';
    const w = `(?:(?:${unicodeLetter})|(?:${unicodeNumber})|(?:${other}))`;
    return inverse ?
      `(?:(?<=${w})(?=${w})|(?<!${w})(?!${w}))` :
      `(?:(?<=${w})(?!${w})|(?<!${w})(?=${w}))`;
  },
  {flag: 'A'}
);

XRegExp.exec("ааа бб вв", XRegExp.tag('u')`\bбб\b`); // null
XRegExp.exec("ааа бб вв", XRegExp.tag('Au')`\bбб\b`); // ['бб', index: 4, ...]

mgoldenbe · 2024-01-31T21:01:04Z

@slevithan Thank you for the detailed reply!
In the meanwhile, I discovered this post. I am wondering whether there is advantage (other than the aesthetic pleasantness of \b) to using XRegExp compared to the plain JS solutions there.

slevithan · 2024-01-31T21:34:01Z

@mgoldenbe there are a couple potential advantages to using the XRegExp addon above over the solution in that post, especially if you're already including XRegExp in your code:

You can share/reuse your regex patterns with other programming languages that also use Unicode-aware \b and \B.
You can freely use Unicode-aware word boundaries in all patterns rather than going through complicated concatenation or function calls to build each regex when you need it (i.e., aesthetic pleasantness at scale).

And you get extra polish for free like erroring when trying to use word boundaries in character classes, support for non-word-boundaries (\B), and pattern caching for better performance.

For many other XRegExp addons (that don't rely on native lookbehind support like this does), XRegExp would also give you the advantage of working in all ES5+ browsers.

slevithan closed this as completed Mar 9, 2018

aero31aero mentioned this issue Jul 9, 2018

bugdown: Realm filters do not match patterns with leading non-whitespace characters zulip/zulip#9883

Closed

slevithan reopened this Mar 31, 2021

kometenstaub mentioned this issue Jan 13, 2022

Place syntax around multi word links and tags chrisgrieser/obsidian-smarter-md-hotkeys#17

Closed

SCWR mentioned this issue Apr 18, 2022

Firefox until 78 support \p, Is there any good solution? #343

Closed

mgoldenbe mentioned this issue Jan 28, 2024

Unicode support for \b revisited #361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode support for word boundary `\b` #228

Unicode support for word boundary `\b` #228

gondalez commented Mar 9, 2018

slevithan commented Mar 9, 2018 •

edited

Loading

gondalez commented Mar 12, 2018

gausie commented Mar 31, 2021

slevithan commented Mar 31, 2021 •

edited

Loading

OultimoCoder commented Jun 26, 2023

mgoldenbe commented Jan 30, 2024 •

edited

Loading

slevithan commented Jan 31, 2024 •

edited

Loading

mgoldenbe commented Jan 31, 2024 •

edited

Loading

slevithan commented Jan 31, 2024

Unicode support for word boundary \b #228

Unicode support for word boundary \b #228

Comments

gondalez commented Mar 9, 2018

slevithan commented Mar 9, 2018 • edited Loading

gondalez commented Mar 12, 2018

gausie commented Mar 31, 2021

slevithan commented Mar 31, 2021 • edited Loading

OultimoCoder commented Jun 26, 2023

mgoldenbe commented Jan 30, 2024 • edited Loading

slevithan commented Jan 31, 2024 • edited Loading

mgoldenbe commented Jan 31, 2024 • edited Loading

slevithan commented Jan 31, 2024

Unicode support for word boundary `\b` #228

Unicode support for word boundary `\b` #228

slevithan commented Mar 9, 2018 •

edited

Loading

slevithan commented Mar 31, 2021 •

edited

Loading

mgoldenbe commented Jan 30, 2024 •

edited

Loading

slevithan commented Jan 31, 2024 •

edited

Loading

mgoldenbe commented Jan 31, 2024 •

edited

Loading