2020-12-15: Japanese glosario page is not alphabetical order #254

masamiy · 2020-12-15T06:05:06Z

https://carpentries.github.io/glosario/ja/ lists Japanese entries based on the first character of the entry. It means that the entries are not categorised by Japanese alphabet (nor English alphabet), but characters. The last entry, 'function', should be top of the current list as it is read as 'kansuu' if terms are categorise by Japanese alphabet.
As there are 46+ characters in Japanese alphabet, I feel we need to have some indexing strategy.

baileythegreen · 2020-12-15T10:33:30Z

@masamiy

The order of the entries is determined by a sort function on line 16 of _includes/glossary.html, which operates on individual characters. It may be that for languages such as Japanese we need to find a different solution entirely. The sort function currently being used is a liquid one, and I very much doubt they have a different one that will sort Japanese correctly. I am familiar with the website infrastructure and the code, but I don't know the Japanese alphabet, so this isn't something I can fix on my own.

I can see two options for solving it:

We find or write a function in something like Ruby or Python to sort Japanese (and any other language that has this problem), based on an input list of the alphabet, if need be.
We move to a different system for storing definitions other than a YAML file so that the sorting can take place at a slightly different step. An example would be an SQLite database which can export its contents, or part of its contents, as a YAML or other config-type file. This involves more changes to the infrastructure, though.

Perhaps @fmichonneau or @gvwilson will have another idea?

fmichonneau · 2020-12-15T15:01:20Z

It looks like option 1 is going to be the way to go.
From a quick search, I saw mecab being mentioned regularly but that's Japanese-specific and wouldn't work for Arabic, Hebrew, Amharic, etc.
From my limited understanding of this, I think the ICU library would order the characters correctly. In R, it's implemented by the stringi/stringr packages, in Python, by PyICU.

baileythegreen · 2020-12-15T15:06:46Z

I think option 1 is certainly easier to implement in the short-term. I can take a stab at writing Python code to do this, though I may need someone to verify the output in those languages.

If I do this, unless someone has an objection, I'll probably try to remove the sort logic from _includes/glossary.html entirely and use one script to do all alphabetising, rather than have it happen in different places based on the language in question.

masamiy · 2020-12-15T23:19:10Z

Hi @baileythegreen @fmichonneau , Thank you for your attention and suggestions. A new sort logic will definitely help for non-alphabet languages. I am happy to check Japanese output. Please let me know if there is anything I can help.

baileythegreen · 2020-12-15T23:33:36Z

@masamiy It'll probably take me a couple of days to get to it because I have some deadlines coming up, but I'll tag you when I do, unless @fmichonneau beats me to it.

masamiy · 2020-12-16T00:02:38Z

Take your time :)

TomKellyGenetics · 2020-12-16T02:24:20Z

@masamiy I think the issue is a mixture of Romaji, Katakana, and Kanji in the terms defined. It's sorting them correctly (as expected for this).

I see two solutions:

Give the terms in Hiragana first and it will sort by them. This could make searching them difficult (do the packages support partial matches.
write a custom script that sorts differently depending on the language (as proposed above). There should be existing solutions for sorting Japanese characters but I think it's working as expected now.

Either way furigana (kanji readings) would need to be supported to sort by them and added for each entry (for option No. 2 this would be a need a new slot I think).

You cannot parse furigana from Kanji automatically (although some databases already exist). I think it is easier to specify the intended reading for each entry.

TomKellyGenetics · 2020-12-16T03:05:12Z

Regarding the order of the entries, the languages on the homepage may need to be changed as well (this is done manually as I understand it).

Sorry this may need it's own issue. (See #259)

froggleston · 2024-09-18T10:35:04Z

@TomKellyGenetics @masamiy @baileythegreen This has taken a while to address, but please check the output on the new Glosario site to raise any sorting issues that still need addressing!

masamiy added question Further information is requested bug Something isn't working lang: ja issues and PR for Japanese entries labels Dec 15, 2020

TomKellyGenetics mentioned this issue Dec 16, 2020

2020-12-16: Order of Languages in homepage #259

Closed

froggleston mentioned this issue Sep 5, 2024

[2024-06-24]: [Alphabetical ordering for Ukrainian and Portuguese languages] #724

Closed

froggleston mentioned this issue Sep 17, 2024

Add support for externalised sorting per language #748

Merged

froggleston self-assigned this Sep 18, 2024

froggleston added documentation Improvements or additions to documentation enhancement New feature or request and removed bug Something isn't working question Further information is requested labels Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2020-12-15: Japanese glosario page is not alphabetical order #254

2020-12-15: Japanese glosario page is not alphabetical order #254

masamiy commented Dec 15, 2020

baileythegreen commented Dec 15, 2020

fmichonneau commented Dec 15, 2020

baileythegreen commented Dec 15, 2020

masamiy commented Dec 15, 2020

baileythegreen commented Dec 15, 2020

masamiy commented Dec 16, 2020

TomKellyGenetics commented Dec 16, 2020 •

edited

Loading

TomKellyGenetics commented Dec 16, 2020 •

edited

Loading

froggleston commented Sep 18, 2024

2020-12-15: Japanese glosario page is not alphabetical order #254

2020-12-15: Japanese glosario page is not alphabetical order #254

Comments

masamiy commented Dec 15, 2020

baileythegreen commented Dec 15, 2020

fmichonneau commented Dec 15, 2020

baileythegreen commented Dec 15, 2020

masamiy commented Dec 15, 2020

baileythegreen commented Dec 15, 2020

masamiy commented Dec 16, 2020

TomKellyGenetics commented Dec 16, 2020 • edited Loading

TomKellyGenetics commented Dec 16, 2020 • edited Loading

froggleston commented Sep 18, 2024

TomKellyGenetics commented Dec 16, 2020 •

edited

Loading

TomKellyGenetics commented Dec 16, 2020 •

edited

Loading