Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default functions should manage their internal bidirectionality #916

Closed
eemeli opened this issue Oct 27, 2024 · 12 comments
Closed

Default functions should manage their internal bidirectionality #916

eemeli opened this issue Oct 27, 2024 · 12 comments
Labels
LDML46.1 MF2.0 Draft Candidate registry Issue pertains to the function registry resolve-candidate This issue appears to have been answered or resolved, and may be closed soon.

Comments

@eemeli
Copy link
Collaborator

eemeli commented Oct 27, 2024

In Handling Bidirectional Text, we mention this:

If a formatted expression itself contains spans with differing directionality,
its formatter SHOULD perform any necessary processing, such as inserting controls or
isolating such parts to ensure that the formatted value displays correctly in a plain text context.

In the default set of functions, we should add text ensuring that they at least properly isolate their internal values. This is particularly relevant for :currency as proposed in #915 and originally discussed in #315 (comment).

Doing so might also allow for the default bidi isolation strategy to not require isolation for all placeholders in RTL messages.

@eemeli eemeli added the registry Issue pertains to the function registry label Oct 27, 2024
@aphillips aphillips added the LDML46.1 MF2.0 Draft Candidate label Oct 27, 2024
@aphillips
Copy link
Member

A lot of formatters already use bidi marks, e.g. ICU NumberFormat uses U+200E to coerce the minus sign to appear in the right place or DateFormat uses it to coerce placement with substrings like GMT-7. DateFormat uses U+200F to prevent / in date formats from misbehaving. Etc.

This isn't really isolation. It's just coercion of the directional runs. The example I gave at UTW wouldn't be fixed by interior isolation:

السعر 1,234.56 AED + 12.99 USD الشحن

\u0627\u0644\u0633\u0639\u0631 1,234.56 AED + 12.99 USD \u0627\u0644\u0634\u062d\u0646

You don't want to force the currency symbol to "choose sides" prematurely. The fix is to install the LRI or RLI based on the exterior string direction:

(RLI)
السعر ⁧1,234.56 AED⁩ + ⁧12.99 USD⁩ الشحن

(LRI)
السعر ⁦1,234.56 AED⁩ + ⁦12.99 USD⁩ الشحن

I agree that it would be nice to not require isolation for all RTL message placeholders, but the problem is detecting when it's needed. Base direction equality isn't it. The presence of opposite direction characters isn't it. The exterior presence of neutrals isn't it. However, a formatter can get "isolate-like" behavior if it includes the right strongly directional marks.

I guess what I'm saying is:

  1. We should amend the Default Bidi Strategy to allow a function to indicate whether to isolate or not.

  2. We should specify that the standard function set take advantage of this.

    Something like:

    The function :number can be implemented in a way that allows formatted values
    to be displayed in different left-to-right and right-to-left contexts without spillover effects.
    When this is done, the :number function SHOULD suppress bidi isolation.

@macchiati
Copy link
Member

+1

@eemeli
Copy link
Collaborator Author

eemeli commented Oct 27, 2024

Thinking about this some more, I think we need to be more explicit about what we expect to happen when e.g. u:dir=ltr is used in a message which has a base LTR directionality, in particular with respect to the default bidi strategy.

At the moment, at least my understanding is that it overrides the placeholder's directionality, which means that e.g.

Hello {$name :string u:dir=ltr} 1234

will never have its placeholder isolated because both the message and the placeholder are LTR.

Wouldn't it make sense to instead always isolate a placeholder that has an explicit u:dir? That would probably match user expectations better, as it would effectively work like the dirattribute in HTML.

@aphillips
Copy link
Member

will never have its placeholder isolated because both the message and the placeholder are LTR.

You're assuming that a string that starts with Latin letters (Hello...) is in English and has an LTR base paragraph direction. But that might not be true. Witness my UTW talk example (HP لابتوب انفي X360 14-ES0033DX 2 في 1، انتل كور i7-1355U...). It doesn't take much to cut down this example to have only strong LTR stuff in it, but a desire to be in Arabic, e.g. HP {$productType} X360 14-ES0033DX {$feature1}, {$featureList :list}, where $productType is لابتوب انفي and the whole string is RTL.

The locale probably determines the base direction, using APIs such as ULocale.isRightToLeft(), even when the message is apparently LTR. We want to isolate a lot because its difficult for humans, let alone algorithms, to figure out what the right behavior is.

@eemeli
Copy link
Collaborator Author

eemeli commented Oct 28, 2024

will never have its placeholder isolated because both the message and the placeholder are LTR.

You're assuming that a string that starts with Latin letters (Hello...) is in English and has an LTR base paragraph direction. But that might not be true.

I mean the specific case where that is true, and we have an LTR message in an LTR locale with a placeholder with u:dir=ltr. With the default bidi strategy, that LTR+LTR combo means that we skip isolation, which I think is not what a user would expect. I now think that every placeholder with an explicit u:dir should be isolated, irrespective of the overall bidi isolation strategy.

@aphillips
Copy link
Member

What's the specific change we need to make here?

@eemeli
Copy link
Collaborator Author

eemeli commented Nov 17, 2024

I filed a PR for the isolation part of this, but that doesn't solve all of this. We still have formatters like :currency that are liable to emit LTR content even for RTL locales, in particular with currencyDisplay=code. For example, this message

{42 :currency currency=EUR currencyDisplay=code} 123 456

would have the trailing content display in an incorrect order when formatted in Hebrew or Arabic, if we didn't isolate all placeholders.

So what we have now works, but it's a rather patch-y fix. If we could rely on :currency (and the other default functions) managing their internal directionality and e.g. in this case include the isolation within their own output, then we would not need the general-level patch and allow RTL placeholders in RTL messages to not be isolated.

@aphillips
Copy link
Member

liable to emit LTR content even for RTL locales

We have to be careful here. The formatted string 42.00 EUR is NOT "LTR content" if the formatting locale is RTL. It should be RTL isolated so that the "EUR" part is rendered in a trailing position (to the left in an RTL sentence). The formatted sequence contains a strong LTR run (which confuses FSI/auto schemes) and LTR speakers expect the rendered string to be 42.00 EUR. But actually that's wrong.

You're very correct that it needs isolation to avoid spillover with the 123 456 part of the string.

allow RTL placeholders in RTL messages to not be isolated

This doesn't work. See my UTW presentation [particularly starting at about the 19:30 mark] for why: any LTR content at all can result in spillover. This is especially true when one has formatted values that contain weak and neutral sequences (dates, numbers, currencies...).

IMHO, we're better off over isolating, with options to turn isolation behavior off than we are trying to suppress isolation.

@eemeli
Copy link
Collaborator Author

eemeli commented Nov 17, 2024

I think we're agreed on isolation being required to avoid spillover. But recall the part of our spec that I quote in the first post here:

If a formatted expression itself contains spans with differing directionality,
its formatter SHOULD perform any necessary processing, such as inserting controls or
isolating such parts to ensure that the formatted value displays correctly in a plain text context.

According to that, and what I propose, it ought to be up to :currency, rather than just the bidi isolation strategy, to manage the isolation of its formatted contents in cases such as my example above.

@aphillips
Copy link
Member

According to that, and what I propose, it ought to be up to :currency, rather than just the bidi isolation strategy, to manage the isolation of its formatted contents in cases such as my example above.

I kind of disagree with your reading.

The function handler isn't that well positioned to manage its exterior bidi isolation. To do so would require information about the context, surrounding characters, paragraph direction, and more. While some handlers might be set up to do this, most should focus on the interior formatting of the value that they are producing. Dates, numbers, currency values, etc. generally produce a mix of weak, strong, and neutral characters that, by themselves, need help.

I wrote a short program (reproduced below, it uses some utilities I have lying around, so some mods are needed to make it compile) to test this for the 12 named date/time/datetime formats and the plain/percent/currency number format. ICU4J v76 has 69 different locales in which the date formatter or plain number formatter (not currency) insert a strong bidi mark into these (the total count is 775 times, so it doesn't happen in every format). With isolation, some of these might not need the strong marker, but who cares. The formatter can focus on its interior needs and can be the standard formatter used by the platform. The MF2 engine can apply (or not) isolation around that.

    public static void checkForBidiControlsA() {
        int count = 0; int itemCount = 0;
        Pattern p = Pattern.compile("[\\u200e\\u200f\\u061c]");
        for (Locale locale : sortAllLocales()) {
            boolean found = false;
            for (int ds : new int[] {DateFormat.SHORT, DateFormat.MEDIUM, DateFormat.LONG, DateFormat.FULL}) {
                DateFormat df = DateFormat.getDateInstance(ds, locale);
                String test = df.format(new Date());
                if (p.matcher(test).find()) {
                    System.out.println(locale.toLanguageTag() + " " + test + " " + Util.native2ascii(test));
                    found = true;
                    itemCount++;
                }
                for (int ts : new int[] {DateFormat.SHORT, DateFormat.MEDIUM, DateFormat.LONG, DateFormat.FULL}) {
                    df = DateFormat.getDateInstance(ts, locale);
                    test = df.format(new Date());
                    if (p.matcher(test).find()) {
                        System.out.println(locale.toLanguageTag() + " " + test + " " + Util.native2ascii(test));
                        found = true;
                        itemCount++;
                    }
                    df = DateFormat.getDateTimeInstance(ds, ts, locale);
                    test = df.format(new Date());
                    if (p.matcher(test).find()) {
                        System.out.println(locale.toLanguageTag() + " " + test + " " + Util.native2ascii(test));
                        found = true;
                        itemCount++;
                    }
                }
            }
            NumberFormat nf = NumberFormat.getInstance(locale);
            String test = nf.format(-1234.569);
            if (p.matcher(test).find()) {
                System.out.println(locale.toLanguageTag() + " " + Util.native2ascii(test));
                found = true;
                itemCount++;
            }
            nf = NumberFormat.getPercentInstance(locale);
            test = nf.format(-1234.569);
            if (p.matcher(test).find()) {
                System.out.println(locale.toLanguageTag() + " " + Util.native2ascii(test));
                found = true;
                itemCount++;
            }
            nf = NumberFormat.getCurrencyInstance(locale);
            test = nf.format(-1234.569);
            if (p.matcher(test).find()) {
                System.out.println(locale.toLanguageTag() + " " + Util.native2ascii(test));
                found = true;
                itemCount++;
            }
            if (found) count++;
        }
        System.out.println(count + " " + itemCount);
    }

@aphillips
Copy link
Member

The other thing I'll add, which I was booting around in my head, is whether we should provide a way for users to control the isolation strategy. u:dir is fine for overriding the actual direction, but we might want to allow users to turn the isolation strategy on/off, e.g.:

You have {$n :number u:bidi=$policy} wildebeest.

Where $policy can be always, never, or auto (the default). The challenge is that mostly one wants this for the whole message. It's really an API feature of the MF2 implementation.

@eemeli
Copy link
Collaborator Author

eemeli commented Nov 18, 2024

Ok; that makes sense. Let's not burden formatting functions with any requirements regarding spillover, and trust that RTL content won't be advertised as LTR.

Where $policy can be always, never, or auto (the default). The challenge is that mostly one wants this for the whole message. It's really an API feature of the MF2 implementation.

Yeah, that does not feel like a placeholder-specific toggle. We already consider the formatter's bidirectional isolation strategy to be configurable, and with #942 including u:dir=auto should always isolate the placeholder.

@aphillips aphillips added the resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LDML46.1 MF2.0 Draft Candidate registry Issue pertains to the function registry resolve-candidate This issue appears to have been answered or resolved, and may be closed soon.
Projects
None yet
Development

No branches or pull requests

3 participants