Support for Northern Sami language #17

osma · 2022-09-01T14:20:46Z

I propose that Simplemma could support the Northern Sami language (ISO 639-1 code se). I understood from this discussion that adding a new language would require a corpus of word + lemma pairs. My colleague @nikopartanen found at least these two corpora that could perhaps be used as raw material:

SIKOR North Saami free corpus - this is a relatively large (9M tokens) corpus. From the description:

The corpus has been automatically processed and linguistically analyzed with the Giellatekno/Divvun tools. Therefore, it may contain wrong annotations.

Another obvious corpus is the Universal Dependency treebank for Northern Sami.

Thoughts on these two corpora? What needs to be done to make this happen?

The text was updated successfully, but these errors were encountered:

adbar · 2022-09-01T14:34:47Z

Hi again, thanks for the suggestions!

I could derive lemmatization data from the universal dependency treebank. The first corpus doesn't look as good as it could lead to wrong word pairs due to unsupervised annotation.

As I said before, concerning Sami I could use data from the Kaikki project but for now there are only about 5,000 words available. It could get better in the future but is not that usable for now.
Ideally I could combine the two resources and see if it leads anywhere.

adbar · 2022-09-01T14:39:01Z

It's unclear to what extent the UD corpus has been manually corrected now that I further look at the description. There could be mistakes there as well, so SIKOR is probably also usable.

I lack the expertise to evaluate these resources on qualitative level, do you have any thoughts to share on the quality of the word/lemma pairs in these resources?

nikopartanen · 2022-09-01T14:47:37Z

I can comment that as far as I know the UD corpus should be manually corrected. I think it was converted to the UD format from something else, in which the correction was probably already done.

adbar · 2022-09-01T16:03:08Z

I get much more word pairs from all inflected forms in Kaikki than from UD (although the UD forms should be more frequent). I'll try to integrate the data soon.

adbar · 2022-09-05T14:13:34Z

It it now added (version 0.8.2, language code se), I used the opportunity to add a few other languages as well ✔️

The linguistic material I used to build the word pairs looks good but it is untested, so I'll leave the thread open. Feel free to report potential bugs here.

osma · 2022-09-07T09:06:55Z

This is great news! @nikopartanen and @mariguttorm are currently testing the Northern Sámi lemmatization on real world example texts.

nikopartanen · 2022-09-07T09:11:31Z

Thank you @adbar! We made a small test file for Northern Sámi. The accuracy is around 75%, although the text also had some Finnish words and names. The file is here, manually checked lemmas at right side column:

https://gist.github.com/nikopartanen/b32f17a6e85dd8ebd02ad24968783a21

The text is from our Northern Sámi project announcement, so it may not be perfectly representative, but at least it's ours to share and work with.

One additional comment:

lemmatize("buorebut",` lang=("se"))

>  'būres'

The correct lemma would be bures, ū is only used in dictionaries and similar environments to show the pronunciation of long u here. It shouldn't appear in lemmas within this context, but probably pops up in the training data.

nikopartanen · 2022-09-07T09:39:37Z

I added here a version that contains the Simplelemma predictions in the third row, so it is easier to measure the accuracy and evaluate the current result.

https://gist.github.com/nikopartanen/b32f17a6e85dd8ebd02ad24968783a21

adbar · 2022-09-07T11:28:25Z

Hi @nikopartanen, thanks for the evaluation!
My impression is that the lemmatizer mostly behaves as expected, it rarely introduces mistakes (i.e. wrong lemmata), nearly all errors are tokens which do not get lemmatized and stay as is. Bearing that in mind and considering the small size of the training data I would say the accuracy isn't bad at all.

Thanks for the suggestion, I will correct the entries comprising the ū symbol in the training data.

You could try to chain Northern Sámi and Finnish lemmatization to see if it changes something on your sample: lang=("se", "fi").

adbar · 2022-10-05T10:39:09Z

Hi @nikopartanen & @osma, have you tried the chain described above and did it improve the results?

Also: since support has been added, can I close this issue for now?

nikopartanen · 2022-10-28T12:30:24Z

I think the current behaviour is about as good as we can get with the current materials. If there are more lemmatized materials somewhere, then training the system with extended data could be done, but the current result is also certainly useful as a part of larger pipelines etc. The issue can be closed now, thank you very much for your work on this topic!

osma · 2022-10-28T12:31:59Z

Thanks again from my part as well. I will close the issue.

osma mentioned this issue Sep 1, 2022

Upgrade simplemma dependency and remove function cache NatLibFi/Annif#617

Closed

adbar added the enhancement New feature or request label Sep 1, 2022

osma closed this as completed Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Northern Sami language #17

Support for Northern Sami language #17

osma commented Sep 1, 2022

adbar commented Sep 1, 2022

adbar commented Sep 1, 2022

nikopartanen commented Sep 1, 2022 •

edited

Loading

adbar commented Sep 1, 2022

adbar commented Sep 5, 2022

osma commented Sep 7, 2022

nikopartanen commented Sep 7, 2022

nikopartanen commented Sep 7, 2022 •

edited

Loading

adbar commented Sep 7, 2022

adbar commented Oct 5, 2022

nikopartanen commented Oct 28, 2022

osma commented Oct 28, 2022

Support for Northern Sami language #17

Support for Northern Sami language #17

Comments

osma commented Sep 1, 2022

adbar commented Sep 1, 2022

adbar commented Sep 1, 2022

nikopartanen commented Sep 1, 2022 • edited Loading

adbar commented Sep 1, 2022

adbar commented Sep 5, 2022

osma commented Sep 7, 2022

nikopartanen commented Sep 7, 2022

nikopartanen commented Sep 7, 2022 • edited Loading

adbar commented Sep 7, 2022

adbar commented Oct 5, 2022

nikopartanen commented Oct 28, 2022

osma commented Oct 28, 2022

nikopartanen commented Sep 1, 2022 •

edited

Loading

nikopartanen commented Sep 7, 2022 •

edited

Loading