-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Northern Sami language #17
Comments
Hi again, thanks for the suggestions! I could derive lemmatization data from the universal dependency treebank. The first corpus doesn't look as good as it could lead to wrong word pairs due to unsupervised annotation. As I said before, concerning Sami I could use data from the Kaikki project but for now there are only about 5,000 words available. It could get better in the future but is not that usable for now. |
It's unclear to what extent the UD corpus has been manually corrected now that I further look at the description. There could be mistakes there as well, so SIKOR is probably also usable. I lack the expertise to evaluate these resources on qualitative level, do you have any thoughts to share on the quality of the word/lemma pairs in these resources? |
I can comment that as far as I know the UD corpus should be manually corrected. I think it was converted to the UD format from something else, in which the correction was probably already done. |
I get much more word pairs from all inflected forms in Kaikki than from UD (although the UD forms should be more frequent). I'll try to integrate the data soon. |
It it now added (version The linguistic material I used to build the word pairs looks good but it is untested, so I'll leave the thread open. Feel free to report potential bugs here. |
This is great news! @nikopartanen and @mariguttorm are currently testing the Northern Sámi lemmatization on real world example texts. |
Thank you @adbar! We made a small test file for Northern Sámi. The accuracy is around 75%, although the text also had some Finnish words and names. The file is here, manually checked lemmas at right side column: https://gist.github.com/nikopartanen/b32f17a6e85dd8ebd02ad24968783a21 The text is from our Northern Sámi project announcement, so it may not be perfectly representative, but at least it's ours to share and work with. One additional comment:
The correct lemma would be |
I added here a version that contains the Simplelemma predictions in the third row, so it is easier to measure the accuracy and evaluate the current result. https://gist.github.com/nikopartanen/b32f17a6e85dd8ebd02ad24968783a21 |
Hi @nikopartanen, thanks for the evaluation! Thanks for the suggestion, I will correct the entries comprising the You could try to chain Northern Sámi and Finnish lemmatization to see if it changes something on your sample: |
Hi @nikopartanen & @osma, have you tried the chain described above and did it improve the results? Also: since support has been added, can I close this issue for now? |
I think the current behaviour is about as good as we can get with the current materials. If there are more lemmatized materials somewhere, then training the system with extended data could be done, but the current result is also certainly useful as a part of larger pipelines etc. The issue can be closed now, thank you very much for your work on this topic! |
Thanks again from my part as well. I will close the issue. |
I propose that Simplemma could support the Northern Sami language (ISO 639-1 code
se
). I understood from this discussion that adding a new language would require a corpus of word + lemma pairs. My colleague @nikopartanen found at least these two corpora that could perhaps be used as raw material:SIKOR North Saami free corpus - this is a relatively large (9M tokens) corpus. From the description:
Another obvious corpus is the Universal Dependency treebank for Northern Sami.
Thoughts on these two corpora? What needs to be done to make this happen?
The text was updated successfully, but these errors were encountered: