-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade simplemma dependency and remove function cache #617
Comments
Hi @adbar!
Yes, we are wrapping Simplemma into a custom class because it has to implement the Annif Analyzer API. We've done the same with other stemming and lemmatizing tools.
Thanks for the tip! Indeed, it doesn't make sense to have two layers of caching, it will just use memory without any additional benefit.
I think we're already doing that: Annif/annif/analyzer/simplemma.py Line 17 in aa50441
Here, Thanks for creating Simplemma! It has just the perfect combination of simplicity, speed and lemmatizing quality as well as being highly multilingual. It's an excellent addition to Annif and already has enabled support for many new languages. You may be interested in this paper that we just published in the Code4Lib Journal. It documents the experiments we did using different lemmatizers and their effect on downstream subject indexing performance. Simplemma did quite well in this comparison and the experiments led to the integration of Simplemma with Annif. |
Hi @osma, thanks for the kind words and the link! Your article looks very interesting indeed, it shows a useful application of robust lemmatization. Please keep in touch if you spot systematic errors or if you want to add further languages, I'd be glad to help. You could use the I see you run language detection in #615, since you're already using simplemma you might as well have a look at its vocabulary-based language detector, I just updated the README accordingly! |
We (@NatLibFi) may be interested in adding support for Northern Sami language (ISO 639-1 code
Ah right. I added this change to PR #618 in this commit: d4abfa4
Oh, this is extremely interesting! We are currently using pycld3 for language detection, but it appears to no longer be actively maintained so we're looking for alternatives (see #593). One possibility was to try to use Lingua (#615) but it's much slower than pycld3. Do you have any idea how accurate the language detection in Simplemma is compared to other alternatives? The Lingua author has done extensive comparisons between different language detection libraries. It would be good to benchmark your algorithm in the same way and compare results. Not that we need super-high accuracy in Annif, where language detection is currently only used for filtering input text so that sentences that are not in the expected language are dropped. |
In any case, I'd be curious to read about your experiments should you publish them. |
Thanks! I will ask my colleagues about Sami language resources. I suspect that 5000 words is way too little, but I think there are bigger corpora available elsewhere. Now that this issue is closed, and this is off-topic anyway, I think it's better to continue the discussion elsewhere. If/when I have something new to tell, I will open an issue about this under the Simplemma project, OK?
Thanks, these are extremely useful pointers! I will continue this in issue #593 where it more properly belongs. I don't expect to be doing very formal experiments (like the lemmatizer tests in the Code4Lib article), but perhaps something along the lines of PR #615 with the same kind of reporting. |
Yes, feel free to open a new issue in Simplemma's repository. Any kind of reporting is fine! |
Done: adbar/simplemma#17 |
Hi, I just noticed that you wrapped a custom class around simplemma, notably to use a function cache.
Interestingly, I had the same idea which indeed speeds up the process. The cache is included in the versions that you use (0.7.* upwards), so you could delete the following line without any impact on performance:
Annif/annif/analyzer/simplemma.py
Line 15 in aa50441
In addition, you could use the
lang
attribute (new in 0.7.0) directly withsimplemma.lemmatize()
.Best,
Adrien
The text was updated successfully, but these errors were encountered: