-
-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-generate hyphenation files from source and add new languages #2102
Conversation
Let's check where we are vs. TeX hyphenation patterns:
|
"el" is ISO 639 code for modern Greek (1453-present). "grc" is ISO 639 code for ancient Greek (prior to 1453). No idea how that relates to your hyphenation files. :) |
Indeed. But I don't know the origin of the current "el" patterns in SILE. It's almost the same file as TeX's "grc" (indeed Ancient Greek), with a bunch of differences here and there, but most of the content is identical,.. And it's totally different from TeX's el-monoton and el-polityon (both Modern Greek)... So I can't say what it is supposed to be ;) |
It seems SILE's "el" patterns were added by @simoncozens in c6e8d0b ("Oops, forgot to include these") directly on master in Feb. 2014. The TeX "grc" patterns were updated in May 2016 ("added support for curly beta") and after better checking, I confirm that's our differences with it. So my assumption is that Simon added the "grc" patterns of that time as "el" -- and as you noted @DavidLRowe , this was likely wrong. It should have been kept as "grc", for Ancient Greek (→ added to my first comment above, we could safely add it too under that name). As for "el", we should likely alias it to "el-monoton": It seems to me that Modern Greek is monotonic in most contexts since the 1982 reform. (We should state it the documentation too...) = Strictly speaking, we should postpone it to 0.16.x as it could make rendered documents different? Conclusion: I'll move on adding EDIT: Doh. Except that the boustrophedon package would kill any standard "grc". Oh well, Ancient Greek will have to wait then, a fix of its own. |
Nevermind I see you already have it posted here, I just missed it. I may re-organize where it is but all the parts seem to be here already. |
Well nearly identical. Some of our patterns had trailing spaces, and sometimes even slight differences (e.g. one pattern, a patter with missing diacritic, etc.). Since it's unclear how they were initially generated, this was likely rough conversion artefacts... So this might even be a fix for some languages.
The patterns added in 2014 (commit af4617d) lack a block for some unknown reason... and had extra spaces at some places, probably bad conversion artifacts.
…dle sources in other branch
82509ad
to
26bb266
Compare
I rehashed the TeX pattern transpiler tooling. The sources are no longer in this branch because I'm not sure we should be distributing them, but I kept them in a branch until we figure that bit out. I updated the imported to read them from the other branch (or any branch of your choice): Sadly it still isn't 100% deterministic because Lua does not guarantee the order in which it outputs non-numeric table keys. That should probably get fixed. I'd also like to refactor it in a way that the comments could be imported too, but we have to start somewhere and this is better than keeping our old patterns just to have the TeX code comments too. I also moved everything back into the |
Do we have an open issue to track untangling el→grc? It will likely be partially handled by BCP-47 stuff, but making sure the hyphenation patterns end up on the right side of the fence when the dust settles should probably be tracked. Edit: Opened as #2123. |
Well typst/hyphen does have its original TeX sources too. EDIT: But just looking at your changes, it's great this way. Cool, and thanks for the rework and changes, it makes sense too this way. |
I wasn't worried about the licensing aspect of redistribution, I was more thinking about the pragmatics of tracking lots of sources we don't install, how to record the versions/upstream locations, etc. We may or not need the couple megs of overhead and history. Again I didn't rule it out, I just wasn't comfortable with the way they were organized. I even thought of just keeping the *.tex sources as the canonical version we track and making the *.lua version a build-time artifact that we install but don't even need to tack (since it is programmatically generated anyway and except for the key ordering issue is deterministic). |
The first commit splits the hyphenation patterns out of the main language logic so data and code are separate. I did it manually and checked each file one by one.
The second commit re-generates most of these patterns from the original TeX sources, with a conversion script. I also checked each file manually (having split them previously makes the diffs more readable).
The (very naive) conversion script is included, as well as the original TeX sources:
Subsequent commits are:
Notes: