-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support for word boundary \b
#228
Comments
Unfortunately, emulating Unicode word boundaries would require native lookbehind support, which is only just being added to the JS spec in EcmaScript 2018. When support spreads to all modern browsers, it will be possible to take this on. |
No problem, thanks for the explanation @slevithan 👍 |
@slevithan can this be implemented now? Is this already available? |
Yes, this is possible now in ES2018 environments. But first you need to define what a Unicode word character is. I'll use the rough approximation That leads to the following way to emulate a Unicode-aware word boundary (
Or breaking it down with XRegExp-style free spacing and comments to explain it:
And here's how to emulate a Unicode-aware non-word-boundary (
If you wanted to add support for Unicode aware XRegExp.addToken(
/\\b/,
() => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
{flag: 'A'}
); Or if you also wanted to support inverse Unicode word boundaries ( XRegExp.addToken(
/\\([bB])/,
(match) => {
const inverse = match[1] === 'B';
return inverse ?
String.raw`(?:(?<=\p{L}\p{M}*)(?=\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?!\p{L}\p{M}*))` :
String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`;
},
{flag: 'A'}
); Alternatively, you could avoid overloading the Note that by not specifying a scope for the tokens added, we're using Heads up that this is untested. Also heads up that
I don't expect to add built-in support for Unicode word boundaries to XRegExp in the short term, but hopefully the details above are enough to add support within your own code. |
Thanks so much for the above code! Would love to get inbuilt support for this in the future! |
This did not work for me. Here is my code: XRegExp = require('xregexp')
base = require('xregexp/lib/addons/unicode-base')
base(XRegExp)
XRegExp.addToken(
/\\b/,
() => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
{flag: 'A'}
);
console.log(XRegExp.exec("ааа бб вв", XRegExp(/\bбб\b/), "uA")) // null What am I doing wrong? |
@mgoldenbe the code works fine but you are incorrectly passing However, I prepared a long reply about an additional issue based on my initial misreading of your Your code above is working as intended. See: XRegExp.addToken(
/\\b/,
() => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
{flag: 'A'}
);
const nativeWordBoundary = /\bXX\b/;
const unicodeLetterBoundary = XRegExp.tag('Au')`\bXX\b`;
nativeWordBoundary.test('愛XX愛'); // true
unicodeLetterBoundary.test('愛XX愛'); // false
unicodeLetterBoundary.test('XX'); // true However, it seems you missed this from my comment above:
Note that native JS regex word boundaries treat ASCII letters, ASCII numbers, and underscore as "word characters". But above I defined a word character merely as a close approximation of a complete Unicode letter. I did not include any numbers (ASCII or otherwise) or underscore. Based on your above code where you expected the number For example, here's a slight modification of my example code that supports Unicode-aware versions of both XRegExp.addToken(
/\\([bB])/,
(match) => {
const inverse = match[1] === 'B';
const unicodeLetter = String.raw`\p{L}\p{M}*`;
const unicodeNumber = String.raw`\p{N}`;
const other = '_';
const w = `(?:(?:${unicodeLetter})|(?:${unicodeNumber})|(?:${other}))`;
return inverse ?
`(?:(?<=${w})(?=${w})|(?<!${w})(?!${w}))` :
`(?:(?<=${w})(?!${w})|(?<!${w})(?=${w}))`;
},
{flag: 'A'}
);
XRegExp.exec("ааа бб вв", XRegExp.tag('u')`\bбб\b`); // null
XRegExp.exec("ааа бб вв", XRegExp.tag('Au')`\bбб\b`); // ['бб', index: 4, ...] |
@slevithan Thank you for the detailed reply! |
@mgoldenbe there are a couple potential advantages to using the XRegExp addon above over the solution in that post, especially if you're already including XRegExp in your code:
And you get extra polish for free like erroring when trying to use word boundaries in character classes, support for non-word-boundaries ( For many other XRegExp addons (that don't rely on native lookbehind support like this does), XRegExp would also give you the advantage of working in all ES5+ browsers. |
Is it possible to extend the unicode support to the word boundary anchor?
For example the russian sentence cannot be split:
^ note the split has no effect on russian
The equivalent and desired behaviour in ruby, for example:
The text was updated successfully, but these errors were encountered: