Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-generate hyphenation files from source and add new languages #2102

Merged
merged 25 commits into from
Oct 1, 2024

Conversation

Omikhleia
Copy link
Member

@Omikhleia Omikhleia commented Sep 7, 2024

The first commit splits the hyphenation patterns out of the main language logic so data and code are separate. I did it manually and checked each file one by one.

The second commit re-generates most of these patterns from the original TeX sources, with a conversion script. I also checked each file manually (having split them previously makes the diffs more readable).
The (very naive) conversion script is included, as well as the original TeX sources:

  • For comparison and easier re-generation if they are later updated (or we update the naive conversion script)
  • With clear indication of their origin and original license (we were not really playing nice here...)

Subsequent commits are:

  • 10 new languages when it was easy: Friulan, Occitan, Upper Sorbian, Telugu, Pali, Galician, Church/Old Slavonic, Albanian, Macedonian, Belarusian. This was not tested and these languages do not have i18n localization strings (fluent), but at least it makes it easier for people interested in these languages to improve upon.
  • A fix for some missing patterns in polytonic Greek.

Notes:

  • Some of our patterns are not derived from the TeX sources, for several reasons we'd need to discuss separately. (Some of them are expected, e.g. we modified the Portuguese support; some might cause breaking changes, e.g. lots of new patterns in English; others have too many differences and likely come from some other sources, presumably OpenOffice, although the commit history is unclear...) = I've kept these pattern files unchanged for now...
  • I haven't checked how to disable the "typos" checker on the whole hyphens directory, if need be.
  • The converter also extracts the "hyphenmins" pseudo-comments so we can have a path toward Hyphenation minimun left/right constraints should be language-specific #2017 eventually...

@Omikhleia Omikhleia added bug Software bug issue refactor Code quality improvements labels Sep 7, 2024
@Omikhleia Omikhleia requested review from a team and alerque as code owners September 7, 2024 23:40
@Omikhleia Omikhleia changed the title Redo hyphenation Re-generate hyphenation files from source and add new languages Sep 7, 2024
@Omikhleia Omikhleia added the enhancement Software improvement or feature request label Sep 7, 2024
@Omikhleia
Copy link
Member Author

Omikhleia commented Sep 8, 2024

Let's check where we are vs. TeX hyphenation patterns:

  • Things that could be added = add them to this PR
    • Coptic (cop)
    • Interlingua (ia)
    • Kurmanji (northern Kurdish, kmr)
    • EDIT: "grc" (Ancient greek) see below
  • Things where we have our own upgrades, so likely won't change?
    • Turkish (tr) : many changes by @alerque
    • Esperanto (eo): solution by @ctrlcctrlv differing from what TeX currently has
  • Things that might have to wait for a better implementation of BCP47...
    • en-GB / en-US (but see also just below, last item)
    • zh-Latn-pinyin
    • mul-ethi (Ethiopic)
    • nb/nn variants (Norsk, we do have them though a refactor could be neat eventually)
    • several la variants (liturgic/classic Latin)
    • mn (Mongolian in Cyrl script or Cyrl-x-lmc for Xalx Mongolian variant)
    • sh (Cyrl vs. Latn, Serbo-Croatian, now deprecated?)
  • Things I am not sure of...
    • our "el" vs. TeX's "grc" (I am not sure what are the intents there) EDIT: See furthers comment, we are probably using the wrong patterns in SILE's "el")
  • Things where we differ but may change for better
    • pt (Portuguese): We have @jodros 's updates, but an incoming PR might improve upon it: additional set of rules to enhancing TeX hyphenation rules for Portug… hyphenation/tex-hyphen#62 Should we wait or go for an early adoption? = 0.16.x or earlier ?
    • es (Spanish): Lots of differences with our patterns = 0.16.x ?
    • th (Thai): Lots of differences with our patterns, and no idea where the latter come from. = (?)
    • bg (Bulgarian) = Lots of differences with our patterns, and no idea where the latter come from. = 0.16.x ? Note the TeX patterns were updated in 2017 so whatever we had might be outdated (?)
    • de (German) = We have something, TeX has several variants depending on orthography reforms, frankly I don't know.
    • en (English, see also above): Currently we are based on en-US, but we lack a whole "additional patterns" that were not present in the original TeX file and were added later (though at an earlier stage than our import). My understanding is that "old" TeX being memory constrained, these were absent from the original implementation. But it's pretty unclear why we don't have the extra patterns... = 0.16.x ? as it could make rendered documents different...

@Omikhleia Omikhleia added this to the v0.15.6 milestone Sep 8, 2024
@DavidLRowe
Copy link
Contributor

"el" is ISO 639 code for modern Greek (1453-present). "grc" is ISO 639 code for ancient Greek (prior to 1453). No idea how that relates to your hyphenation files. :)

@Omikhleia
Copy link
Member Author

Omikhleia commented Sep 9, 2024

"el" is ISO 639 code for modern Greek (1453-present). "grc" is ISO 639 code for ancient Greek (prior to 1453). No idea how that relates to your hyphenation files. :)

Indeed. But I don't know the origin of the current "el" patterns in SILE. It's almost the same file as TeX's "grc" (indeed Ancient Greek), with a bunch of differences here and there, but most of the content is identical,.. And it's totally different from TeX's el-monoton and el-polityon (both Modern Greek)... So I can't say what it is supposed to be ;)

@Omikhleia
Copy link
Member Author

Omikhleia commented Sep 9, 2024

It seems SILE's "el" patterns were added by @simoncozens in c6e8d0b ("Oops, forgot to include these") directly on master in Feb. 2014. The TeX "grc" patterns were updated in May 2016 ("added support for curly beta") and after better checking, I confirm that's our differences with it.

So my assumption is that Simon added the "grc" patterns of that time as "el" -- and as you noted @DavidLRowe , this was likely wrong. It should have been kept as "grc", for Ancient Greek (→ added to my first comment above, we could safely add it too under that name).

As for "el", we should likely alias it to "el-monoton": It seems to me that Modern Greek is monotonic in most contexts since the 1982 reform. (We should state it the documentation too...) = Strictly speaking, we should postpone it to 0.16.x as it could make rendered documents different?

Conclusion: I'll move on adding 4 3 pattern files (cop, ia, kmr, grc) to this PR. The other points need a dedicated issue...

EDIT: Doh. Except that the boustrophedon package would kill any standard "grc". Oh well, Ancient Greek will have to wait then, a fix of its own.

@Omikhleia Omikhleia marked this pull request as draft September 9, 2024 17:29
@Omikhleia Omikhleia marked this pull request as ready for review September 9, 2024 18:25
@Omikhleia Omikhleia self-assigned this Sep 22, 2024
@alerque
Copy link
Member

alerque commented Oct 1, 2024

Along with this PR, can you post the script you used for these conversions? I'd like to setup the tooling eventually so we can easily check these for updates. I thought I saw your script (Lua based?) somewhere but I don't see it linked here. I would probably start by bunging into build-aux/ for now, and we can work out how to automate this a bit later.

Nevermind I see you already have it posted here, I just missed it. I may re-organize where it is but all the parts seem to be here already.

@alerque alerque requested a review from a team as a code owner October 1, 2024 15:44
@alerque
Copy link
Member

alerque commented Oct 1, 2024

I rehashed the TeX pattern transpiler tooling. The sources are no longer in this branch because I'm not sure we should be distributing them, but I kept them in a branch until we figure that bit out. I updated the imported to read them from the other branch (or any branch of your choice): ./build-aux/import-tex-hyphens.sh <branch_name>. The parser is also a little different—just to handle working on STDIN/STDOUT instead of the Lua having to know about where to find files to read and write. It's also normalizing the output with our Lua cod styling rules.

Sadly it still isn't 100% deterministic because Lua does not guarantee the order in which it outputs non-numeric table keys. That should probably get fixed. I'd also like to refactor it in a way that the comments could be imported too, but we have to start somewhere and this is better than keeping our old patterns just to have the TeX code comments too.

I also moved everything back into the /languages namespace because I couldn't see a good reason to have it as an entirely separate class of source files. I am also planning on moving the FTL sources into this space too, but decided it was out of scope for this PR.

@alerque
Copy link
Member

alerque commented Oct 1, 2024

Do we have an open issue to track untangling el→grc? It will likely be partially handled by BCP-47 stuff, but making sure the hyphenation patterns end up on the right side of the fence when the dust settles should probably be tracked.

Edit: Opened as #2123.

@alerque alerque merged commit a80093d into sile-typesetter:master Oct 1, 2024
20 checks passed
@Omikhleia
Copy link
Member Author

Omikhleia commented Oct 1, 2024

The sources are no longer in this branch because I'm not sure we should be distributing them

Well typst/hyphen does have its original TeX sources too.
I think we can have them, as long as explicitly write where the came from + we could have some sentence in READMEs etc. on the special licenses applying to some data files.
(Theoretically, we cannot distribute our own Lua version without a good disclaimer too)

EDIT: But just looking at your changes, it's great this way. Cool, and thanks for the rework and changes, it makes sense too this way.

@alerque
Copy link
Member

alerque commented Oct 1, 2024

I wasn't worried about the licensing aspect of redistribution, I was more thinking about the pragmatics of tracking lots of sources we don't install, how to record the versions/upstream locations, etc. We may or not need the couple megs of overhead and history. Again I didn't rule it out, I just wasn't comfortable with the way they were organized. I even thought of just keeping the *.tex sources as the canonical version we track and making the *.lua version a build-time artifact that we install but don't even need to tack (since it is programmatically generated anyway and except for the key ordering issue is deterministic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software bug issue enhancement Software improvement or feature request refactor Code quality improvements
Projects
Development

Successfully merging this pull request may close these issues.

3 participants