Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade simplemma dependency and remove function cache #617

Closed
adbar opened this issue Aug 31, 2022 · 7 comments · Fixed by #618
Closed

Upgrade simplemma dependency and remove function cache #617

adbar opened this issue Aug 31, 2022 · 7 comments · Fixed by #618
Milestone

Comments

@adbar
Copy link

adbar commented Aug 31, 2022

Hi, I just noticed that you wrapped a custom class around simplemma, notably to use a function cache.

Interestingly, I had the same idea which indeed speeds up the process. The cache is included in the versions that you use (0.7.* upwards), so you could delete the following line without any impact on performance:

@functools.lru_cache(maxsize=500000)

In addition, you could use the lang attribute (new in 0.7.0) directly with simplemma.lemmatize().

Best,
Adrien

@osma
Copy link
Member

osma commented Sep 1, 2022

Hi @adbar!

Hi, I just noticed that you wrapped a custom class around simplemma, notably to use a function cache.

Yes, we are wrapping Simplemma into a custom class because it has to implement the Annif Analyzer API. We've done the same with other stemming and lemmatizing tools.

Interestingly, I had the same idea which indeed speeds up the process. The cache is included in the versions that you use (0.7.* upwards), so you could delete the following line without any impact on performance:

Thanks for the tip! Indeed, it doesn't make sense to have two layers of caching, it will just use memory without any additional benefit.

In addition, you could use the lang attribute (new in 0.7.0) directly with simplemma.lemmatize().

I think we're already doing that:

return simplemma.lemmatize(word, self.lang)

Here, self.lang is a language code (string).

Thanks for creating Simplemma! It has just the perfect combination of simplicity, speed and lemmatizing quality as well as being highly multilingual. It's an excellent addition to Annif and already has enabled support for many new languages.

You may be interested in this paper that we just published in the Code4Lib Journal. It documents the experiments we did using different lemmatizers and their effect on downstream subject indexing performance. Simplemma did quite well in this comparison and the experiments led to the integration of Simplemma with Annif.

@adbar
Copy link
Author

adbar commented Sep 1, 2022

Hi @osma, thanks for the kind words and the link! Your article looks very interesting indeed, it shows a useful application of robust lemmatization. Please keep in touch if you spot systematic errors or if you want to add further languages, I'd be glad to help.

You could use the lang= argument explicitly to make the code more robust in case things change in the future:
return simplemma.lemmatize(word, lang=self.lang)

I see you run language detection in #615, since you're already using simplemma you might as well have a look at its vocabulary-based language detector, I just updated the README accordingly!

@osma
Copy link
Member

osma commented Sep 1, 2022

Please keep in touch if you spot systematic errors or if you want to add further languages, I'd be glad to help.

We (@NatLibFi) may be interested in adding support for Northern Sami language (ISO 639-1 code se) some time in the future, as we are currently working on improving our infrastructure to better support minority languages. It would be useful to know what kind of corpora etc. are needed for adding a new language. Maybe (hopefully) the necessary resources already exist in the various language banks and they would just have to be adapted and repurposed for Simplemma.

You could use the lang= argument explicitly to make the code more robust in case things change in the future:
return simplemma.lemmatize(word, lang=self.lang)

Ah right. I added this change to PR #618 in this commit: d4abfa4

I see you run language detection in #615, since you're already using simplemma you might as well have a look at its vocabulary-based language detector, I just updated the README accordingly!

Oh, this is extremely interesting! We are currently using pycld3 for language detection, but it appears to no longer be actively maintained so we're looking for alternatives (see #593). One possibility was to try to use Lingua (#615) but it's much slower than pycld3. Do you have any idea how accurate the language detection in Simplemma is compared to other alternatives? The Lingua author has done extensive comparisons between different language detection libraries. It would be good to benchmark your algorithm in the same way and compare results. Not that we need super-high accuracy in Annif, where language detection is currently only used for filtering input text so that sentences that are not in the expected language are dropped.

@osma osma closed this as completed in #618 Sep 1, 2022
@osma osma modified the milestones: Short term, 0.59 Sep 1, 2022
@adbar
Copy link
Author

adbar commented Sep 1, 2022

  • Concerning Sami I could use data from the Kaikki project but for now there are only about 5,000 words available, would that be enough?
    If you know where to find lemmatization lists (where for example each line entails a lemma with its flected forms) I could use them as well. See for instance Use additional sources for better coverage adbar/simplemma#1
  • In my experience, language detection accuracy varies greatly according to the languages and text types you're interested in. So it's definitely worth running an evaluation before choosing a software solution.
    • Simplemma should be good enough and especially good on noisy text.
    • I've used langid.py ever since it has been made available and I recently released a Python3 port (py3langid). It works on N-Grams instead of whole words so both approaches are complementary.
    • Fasttext works fast and well for most languages, you can look for related work, whatthelang for example, luga looks promising.
    • For yet another approach (hunspell + fasttext) see fastspell.

In any case, I'd be curious to read about your experiments should you publish them.

@osma
Copy link
Member

osma commented Sep 1, 2022

Concerning Sami I could use data from the Kaikki project but for now there are only about 5,000 words available, would that be enough?
If you know where to find lemmatization lists (where for example each line entails a lemma with its flected forms) I could use them as well. See for instance adbar/simplemma#1

Thanks! I will ask my colleagues about Sami language resources. I suspect that 5000 words is way too little, but I think there are bigger corpora available elsewhere. Now that this issue is closed, and this is off-topic anyway, I think it's better to continue the discussion elsewhere. If/when I have something new to tell, I will open an issue about this under the Simplemma project, OK?

In my experience, language detection accuracy varies greatly according to the languages and text types you're interested in. So it's definitely worth running an evaluation before choosing a software solution.

  • Simplemma should be good enough and especially good on noisy text.
  • I've used langid.py ever since it has been made available and I recently released a Python3 port (py3langid). It works on N-Grams instead of whole words so both approaches are complementary.
  • Fasttext works fast and well for most languages, you can look for related work, whatthelang for example, luga looks promising.
  • For yet another approach (hunspell + fasttext) see fastspell.

In any case, I'd be curious to read about your experiments should you publish them.

Thanks, these are extremely useful pointers!

I will continue this in issue #593 where it more properly belongs. I don't expect to be doing very formal experiments (like the lemmatizer tests in the Code4Lib article), but perhaps something along the lines of PR #615 with the same kind of reporting.

@adbar
Copy link
Author

adbar commented Sep 1, 2022

Yes, feel free to open a new issue in Simplemma's repository.

Any kind of reporting is fine!
I'm still working on simplemma's language detector at the moment, I'll post results once it's ready.

@osma
Copy link
Member

osma commented Sep 1, 2022

Yes, feel free to open a new issue in Simplemma's repository.

Done: adbar/simplemma#17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants