-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default functions should manage their internal bidirectionality #916
Comments
A lot of formatters already use bidi marks, e.g. ICU NumberFormat uses U+200E to coerce the minus sign to appear in the right place or DateFormat uses it to coerce placement with substrings like This isn't really isolation. It's just coercion of the directional runs. The example I gave at UTW wouldn't be fixed by interior isolation: السعر 1,234.56 AED + 12.99 USD الشحن
You don't want to force the currency symbol to "choose sides" prematurely. The fix is to install the LRI or RLI based on the exterior string direction:
I agree that it would be nice to not require isolation for all RTL message placeholders, but the problem is detecting when it's needed. Base direction equality isn't it. The presence of opposite direction characters isn't it. The exterior presence of neutrals isn't it. However, a formatter can get "isolate-like" behavior if it includes the right strongly directional marks. I guess what I'm saying is:
|
+1 |
Thinking about this some more, I think we need to be more explicit about what we expect to happen when e.g. At the moment, at least my understanding is that it overrides the placeholder's directionality, which means that e.g.
will never have its placeholder isolated because both the message and the placeholder are LTR. Wouldn't it make sense to instead always isolate a placeholder that has an explicit |
You're assuming that a string that starts with Latin letters ( The locale probably determines the base direction, using APIs such as |
I mean the specific case where that is true, and we have an LTR message in an LTR locale with a placeholder with |
What's the specific change we need to make here? |
I filed a PR for the isolation part of this, but that doesn't solve all of this. We still have formatters like
would have the trailing content display in an incorrect order when formatted in Hebrew or Arabic, if we didn't isolate all placeholders. So what we have now works, but it's a rather patch-y fix. If we could rely on |
We have to be careful here. The formatted string You're very correct that it needs isolation to avoid spillover with the
This doesn't work. See my UTW presentation [particularly starting at about the 19:30 mark] for why: any LTR content at all can result in spillover. This is especially true when one has formatted values that contain weak and neutral sequences (dates, numbers, currencies...). IMHO, we're better off over isolating, with options to turn isolation behavior off than we are trying to suppress isolation. |
I think we're agreed on isolation being required to avoid spillover. But recall the part of our spec that I quote in the first post here:
According to that, and what I propose, it ought to be up to |
I kind of disagree with your reading. The function handler isn't that well positioned to manage its exterior bidi isolation. To do so would require information about the context, surrounding characters, paragraph direction, and more. While some handlers might be set up to do this, most should focus on the interior formatting of the value that they are producing. Dates, numbers, currency values, etc. generally produce a mix of weak, strong, and neutral characters that, by themselves, need help. I wrote a short program (reproduced below, it uses some utilities I have lying around, so some mods are needed to make it compile) to test this for the 12 named date/time/datetime formats and the plain/percent/currency number format. ICU4J v76 has 69 different locales in which the date formatter or plain number formatter (not currency) insert a strong bidi mark into these (the total count is 775 times, so it doesn't happen in every format). With isolation, some of these might not need the strong marker, but who cares. The formatter can focus on its interior needs and can be the standard formatter used by the platform. The MF2 engine can apply (or not) isolation around that. public static void checkForBidiControlsA() {
int count = 0; int itemCount = 0;
Pattern p = Pattern.compile("[\\u200e\\u200f\\u061c]");
for (Locale locale : sortAllLocales()) {
boolean found = false;
for (int ds : new int[] {DateFormat.SHORT, DateFormat.MEDIUM, DateFormat.LONG, DateFormat.FULL}) {
DateFormat df = DateFormat.getDateInstance(ds, locale);
String test = df.format(new Date());
if (p.matcher(test).find()) {
System.out.println(locale.toLanguageTag() + " " + test + " " + Util.native2ascii(test));
found = true;
itemCount++;
}
for (int ts : new int[] {DateFormat.SHORT, DateFormat.MEDIUM, DateFormat.LONG, DateFormat.FULL}) {
df = DateFormat.getDateInstance(ts, locale);
test = df.format(new Date());
if (p.matcher(test).find()) {
System.out.println(locale.toLanguageTag() + " " + test + " " + Util.native2ascii(test));
found = true;
itemCount++;
}
df = DateFormat.getDateTimeInstance(ds, ts, locale);
test = df.format(new Date());
if (p.matcher(test).find()) {
System.out.println(locale.toLanguageTag() + " " + test + " " + Util.native2ascii(test));
found = true;
itemCount++;
}
}
}
NumberFormat nf = NumberFormat.getInstance(locale);
String test = nf.format(-1234.569);
if (p.matcher(test).find()) {
System.out.println(locale.toLanguageTag() + " " + Util.native2ascii(test));
found = true;
itemCount++;
}
nf = NumberFormat.getPercentInstance(locale);
test = nf.format(-1234.569);
if (p.matcher(test).find()) {
System.out.println(locale.toLanguageTag() + " " + Util.native2ascii(test));
found = true;
itemCount++;
}
nf = NumberFormat.getCurrencyInstance(locale);
test = nf.format(-1234.569);
if (p.matcher(test).find()) {
System.out.println(locale.toLanguageTag() + " " + Util.native2ascii(test));
found = true;
itemCount++;
}
if (found) count++;
}
System.out.println(count + " " + itemCount);
} |
The other thing I'll add, which I was booting around in my head, is whether we should provide a way for users to control the isolation strategy.
Where |
Ok; that makes sense. Let's not burden formatting functions with any requirements regarding spillover, and trust that RTL content won't be advertised as LTR.
Yeah, that does not feel like a placeholder-specific toggle. We already consider the formatter's bidirectional isolation strategy to be configurable, and with #942 including |
In Handling Bidirectional Text, we mention this:
In the default set of functions, we should add text ensuring that they at least properly isolate their internal values. This is particularly relevant for
:currency
as proposed in #915 and originally discussed in #315 (comment).Doing so might also allow for the default bidi isolation strategy to not require isolation for all placeholders in RTL messages.
The text was updated successfully, but these errors were encountered: