Upgrade simplemma dependency and remove function cache #617

adbar · 2022-08-31T15:17:29Z

Hi, I just noticed that you wrapped a custom class around simplemma, notably to use a function cache.

Interestingly, I had the same idea which indeed speeds up the process. The cache is included in the versions that you use (0.7.* upwards), so you could delete the following line without any impact on performance:

Annif/annif/analyzer/simplemma.py

Line 15 in aa50441

@functools.lru_cache(maxsize=500000)

In addition, you could use the lang attribute (new in 0.7.0) directly with simplemma.lemmatize().

Best,
Adrien

The text was updated successfully, but these errors were encountered:

osma · 2022-09-01T07:23:51Z

Hi @adbar!

Hi, I just noticed that you wrapped a custom class around simplemma, notably to use a function cache.

Yes, we are wrapping Simplemma into a custom class because it has to implement the Annif Analyzer API. We've done the same with other stemming and lemmatizing tools.

Interestingly, I had the same idea which indeed speeds up the process. The cache is included in the versions that you use (0.7.* upwards), so you could delete the following line without any impact on performance:

Thanks for the tip! Indeed, it doesn't make sense to have two layers of caching, it will just use memory without any additional benefit.

In addition, you could use the lang attribute (new in 0.7.0) directly with simplemma.lemmatize().

I think we're already doing that:

Annif/annif/analyzer/simplemma.py

Line 17 in aa50441

return simplemma.lemmatize(word, self.lang)

Here, self.lang is a language code (string).

Thanks for creating Simplemma! It has just the perfect combination of simplicity, speed and lemmatizing quality as well as being highly multilingual. It's an excellent addition to Annif and already has enabled support for many new languages.

You may be interested in this paper that we just published in the Code4Lib Journal. It documents the experiments we did using different lemmatizers and their effect on downstream subject indexing performance. Simplemma did quite well in this comparison and the experiments led to the integration of Simplemma with Annif.

adbar · 2022-09-01T12:00:09Z

Hi @osma, thanks for the kind words and the link! Your article looks very interesting indeed, it shows a useful application of robust lemmatization. Please keep in touch if you spot systematic errors or if you want to add further languages, I'd be glad to help.

You could use the lang= argument explicitly to make the code more robust in case things change in the future:
return simplemma.lemmatize(word, lang=self.lang)

I see you run language detection in #615, since you're already using simplemma you might as well have a look at its vocabulary-based language detector, I just updated the README accordingly!

osma · 2022-09-01T13:13:31Z

Please keep in touch if you spot systematic errors or if you want to add further languages, I'd be glad to help.

We (@NatLibFi) may be interested in adding support for Northern Sami language (ISO 639-1 code se) some time in the future, as we are currently working on improving our infrastructure to better support minority languages. It would be useful to know what kind of corpora etc. are needed for adding a new language. Maybe (hopefully) the necessary resources already exist in the various language banks and they would just have to be adapted and repurposed for Simplemma.

You could use the lang= argument explicitly to make the code more robust in case things change in the future:
return simplemma.lemmatize(word, lang=self.lang)

Ah right. I added this change to PR #618 in this commit: d4abfa4

I see you run language detection in #615, since you're already using simplemma you might as well have a look at its vocabulary-based language detector, I just updated the README accordingly!

Oh, this is extremely interesting! We are currently using pycld3 for language detection, but it appears to no longer be actively maintained so we're looking for alternatives (see #593). One possibility was to try to use Lingua (#615) but it's much slower than pycld3. Do you have any idea how accurate the language detection in Simplemma is compared to other alternatives? The Lingua author has done extensive comparisons between different language detection libraries. It would be good to benchmark your algorithm in the same way and compare results. Not that we need super-high accuracy in Annif, where language detection is currently only used for filtering input text so that sentences that are not in the expected language are dropped.

adbar · 2022-09-01T13:49:13Z

Concerning Sami I could use data from the Kaikki project but for now there are only about 5,000 words available, would that be enough?
If you know where to find lemmatization lists (where for example each line entails a lemma with its flected forms) I could use them as well. See for instance Use additional sources for better coverage adbar/simplemma#1
In my experience, language detection accuracy varies greatly according to the languages and text types you're interested in. So it's definitely worth running an evaluation before choosing a software solution.
- Simplemma should be good enough and especially good on noisy text.
- I've used langid.py ever since it has been made available and I recently released a Python3 port (py3langid). It works on N-Grams instead of whole words so both approaches are complementary.
- Fasttext works fast and well for most languages, you can look for related work, whatthelang for example, luga looks promising.
- For yet another approach (hunspell + fasttext) see fastspell.

In any case, I'd be curious to read about your experiments should you publish them.

osma · 2022-09-01T14:00:10Z

Concerning Sami I could use data from the Kaikki project but for now there are only about 5,000 words available, would that be enough?
If you know where to find lemmatization lists (where for example each line entails a lemma with its flected forms) I could use them as well. See for instance adbar/simplemma#1

Thanks! I will ask my colleagues about Sami language resources. I suspect that 5000 words is way too little, but I think there are bigger corpora available elsewhere. Now that this issue is closed, and this is off-topic anyway, I think it's better to continue the discussion elsewhere. If/when I have something new to tell, I will open an issue about this under the Simplemma project, OK?

In my experience, language detection accuracy varies greatly according to the languages and text types you're interested in. So it's definitely worth running an evaluation before choosing a software solution.

Simplemma should be good enough and especially good on noisy text.

I've used langid.py ever since it has been made available and I recently released a Python3 port (py3langid). It works on N-Grams instead of whole words so both approaches are complementary.

Fasttext works fast and well for most languages, you can look for related work, whatthelang for example, luga looks promising.

For yet another approach (hunspell + fasttext) see fastspell.

In any case, I'd be curious to read about your experiments should you publish them.

Thanks, these are extremely useful pointers!

I will continue this in issue #593 where it more properly belongs. I don't expect to be doing very formal experiments (like the lemmatizer tests in the Code4Lib article), but perhaps something along the lines of PR #615 with the same kind of reporting.

adbar · 2022-09-01T14:18:52Z

Yes, feel free to open a new issue in Simplemma's repository.

Any kind of reporting is fine!
I'm still working on simplemma's language detector at the moment, I'll post results once it's ready.

osma · 2022-09-01T14:21:10Z

Yes, feel free to open a new issue in Simplemma's repository.

Done: adbar/simplemma#17

osma added the enhancement label Sep 1, 2022

osma added this to the Short term milestone Sep 1, 2022

osma added a commit that referenced this issue Sep 1, 2022

upgrade to simplemma 0.8 and disable unnecessary cache. Fixes #617

da70c21

osma mentioned this issue Sep 1, 2022

upgrade to simplemma 0.8 and disable unnecessary cache #618

Merged

osma closed this as completed in #618 Sep 1, 2022

osma modified the milestones: Short term, 0.59 Sep 1, 2022

osma mentioned this issue Sep 1, 2022

Replace pycld3 dependency? #593

Closed

osma mentioned this issue Sep 1, 2022

Support for Northern Sami language adbar/simplemma#17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade simplemma dependency and remove function cache #617

Upgrade simplemma dependency and remove function cache #617

adbar commented Aug 31, 2022

osma commented Sep 1, 2022

adbar commented Sep 1, 2022

osma commented Sep 1, 2022

adbar commented Sep 1, 2022

osma commented Sep 1, 2022

adbar commented Sep 1, 2022

osma commented Sep 1, 2022

Upgrade simplemma dependency and remove function cache #617

Upgrade simplemma dependency and remove function cache #617

Comments

adbar commented Aug 31, 2022

osma commented Sep 1, 2022

adbar commented Sep 1, 2022

osma commented Sep 1, 2022

adbar commented Sep 1, 2022

osma commented Sep 1, 2022

adbar commented Sep 1, 2022

osma commented Sep 1, 2022