Replace pycld3 dependency? #593

osma · 2022-06-21T10:44:33Z

The pycld3 language detection library which we depend on seems to have install issues on Python 3.10 (see #589). The last release 0.22 was in March 2021.

I think we should consider switching to a more actively maintained library. This should be easy now that we are only using language detection for language filtering but not in other parts of the Annif codebase.

A possible promising candidate would be Lingua but there are others.

osma · 2022-06-21T10:49:55Z

It should be noted that Lingua is a fairly new library and so has a very short track record, with only two releases so far.

osma · 2022-06-22T11:23:00Z

There is an issue asking about Python 3.10 support for pycld3: bsolomon1124/pycld3#31

osma · 2022-08-04T14:03:45Z

As pointed out by @adulau in this comment, Lingua can use huge amounts of memory. I tested it in the default lazy loading configuration, and detecting the language of the example sentence languages are awesome required 1.8GB of memory; I assume languages like Russian, Arabic and Chinese were excluded because they are written in non-Latin script. When I preloaded all languages, the memory usage was 2.6GB.

pemistahl · 2022-08-11T21:44:27Z

Hello, I'm the author of Lingua. I've managed to reduce the memory consumption of the library. All language models together now just take around 800 MB in memory. Perhaps you want to re-evaluate Lingua for this project once again. If you have any questions, feel free to ask here or join this discussion that @osma has opened.

osma · 2022-08-19T06:02:19Z

Thanks @pemistahl , that is excellent news! We will take a new look at Lingua.

pemistahl · 2022-08-22T18:19:48Z

@osma I have just released Lingua 1.1.0. In high accuracy mode, memory consumption is now at 800 MB. In low accuracy mode, it's even just 60 MB.

osma · 2022-08-23T08:17:38Z

@pemistahl Whoa, that's quite an improvement!

pemistahl · 2022-08-23T08:24:28Z

Yes, that's because the models are now stored in NumPy arrays instead of dictionaries. Querying the arrays is slower than querying dictionaries, that's the downside. But I still use a dictionary as a cache for already looked-up ngrams to speed up the process again.

pemistahl · 2022-08-26T08:52:57Z

FYI: There was a little bug in version 1.1.0 that caused wrong probabilities to be returned for certain ngrams. I've just fixed that. So please use version 1.1.1 now for your tests. Thank you.

osma · 2022-08-26T14:14:54Z

I did some testing of Lingua in a draft PR #615, you may want to check that out @pemistahl

osma · 2022-09-01T14:02:43Z

@adbar suggested these other language detection approaches in #617 (comment) :

Simplemma should be good enough and especially good on noisy text.

I've used langid.py ever since it has been made available and I recently released a Python3 port (py3langid). It works on N-Grams instead of whole words so both approaches are complementary.

Fasttext works fast and well for most languages, you can look for related work, whatthelang for example, luga looks promising.

For yet another approach (hunspell + fasttext) see fastspell.

We could take a look at these and compare how well they work, similar to the Lingua experiments in PR #615 but testing on a different data set (e.g. Finnish language jyu-theses) where filtering by language actually improves results.

adbar · 2022-09-01T15:34:44Z

Hi, just a quick evaluation on my side:

WiLI-2018 dataset (Wikipedia sentences, so pretty regular, rather short input, noisy with named entities)
A few Germanic languages not too far from another (da, de, en, lb, nl, sv) because of my current focus, 3000 texts in total
Packages initialized beforehand, with the right language subset where possible → *
Software notes: I couldn't install fastspell, lingua-language-detector doesn't feature Luxemburgish, lplangid and langdetect are left out because of their poor performance (maybe I missed something), latest simplemma from the repository
Python 3.8.10

Package	Accuracy	Time
luga	0.954	0.184
pycld2	0.945	0.068
pycld3	0.967	1.599
py3langid	0.961	0.709
py3langid*	0.981	0.404
simplemma	0.945	5.471
simplemma*	0.948	1.102
whatthelang	0.967	0.228

Quick and dirty approach, many questions left out here! I'm open for discussion and for wider benchmarks.

osma · 2022-09-28T12:56:52Z

I created PR #626 which uses Simplemma for language detection instead of pycld3 (or Lingua in PR #615). I intend to benchmark these three approaches in the near future.

adulau · 2022-09-29T04:22:07Z

@adbar Thank you very much for the benchmark. Do you have some statistics on memory usage too? I remember some libs where pretty fast but with significant memory usage compared to some others like pycld3.

osma · 2022-09-29T11:55:23Z

I have now redone the benchmarks described in #615 (comment) with some changes.

This time I used the parts of the Finto AI data set and Finnish language documents and YSO Filolaos as the vocabulary. Again I used two project configurations with two backend algorithms, MLLM and Omikuji Parabel. For training MLLM, I used the fulltext-train collection of fulltext documents from different sources (n=2788; sorry, I cannot share these). For training Omikuji, I used the file yso-finna-fi-01.tsv.gz (n=2M shorttext documents; in practice these will not be filtered during training as the language filter isn't applied on very short documents). Both projects were evaluated using Finnish language test documents (fin-test subset) from the jyu-theses collection (n=766).

I compared current master (which uses pycld3) to the PR #615 branch which uses Lingua ~~1.1.2~~ 1.1.3 - (mostly low-accuracy (LA) mode, but reran the evaluations only in high-accuracy (HA) mode as well) and to the PR #623 branch which uses Simplemma 0.8.2 for language filtering. As a baseline, I also used project configurations with no language filtering. Here are the project configurations:

[yso-mllm-fi-filter]
language=fi
backend=mllm
vocab=yso
analyzer=voikko(fi)
transform=limit(10000),filter_lang,limit(5000)

[yso-omikuji-parabel-fi-filter]
language=fi
backend=omikuji
analyzer=voikko(fi)
vocab=yso 
transform=limit(10000),filter_lang,limit(5000)

Again, for the baseline case with no filter I used transform=limit(5000) instead. For performance stats, I used /usr/bin/time -v. "time" means total user time over all CPU cores and "mem" means maximum resident set size. All train and eval operations were performed using 8 parallel jobs (-j 8).

operation	no filter time	no filter mem	pycld3 time	pycld3 mem	lingua LA time	lingua LA mem	lingua HA time	lingua HA mem	simplemma time	simplemma mem
train mllm	2797	1924600	2851	1940868	6242	1952924			2686	2844544
eval mllm	359	531572	368	531064	1133	530776	24657	759592	392	1335972
train omikuji	3777	6230084	4004	6495436	3770	6428592			3657	7082264
eval omikuji	83	2644828	96	2649708	856	2641988	23418	2897788	124	3482256

Evaluation results (higher is better):

Project type	no filter f1@5	no filter ndcg	pycld3 f1@5	pycld3 ndcg	lingua LA f1@5	lingua LA ndcg	lingua HA f1@5	lingua HA ndcg	simplemma f1@5	simplemma ndcg
mllm	0.4646	0.6091	0.4653	0.6132	0.4633	0.6113	0.4639	0.6121	0.4614	0.6107
omikuji	0.3453	0.4605	0.3697	0.4923	0.3661	0.4899	0.3663	0.4898	0.3733	0.4921

Observations:

speed: pycld3 is the fastest, Simplemma is not far behind, Lingua is much much slower, even more so in high accuracy mode
memory overhead of pycld3 and Lingua LA appears to be very small, while Simplemma adds +800MB to the resident set size; Lingua HA adds around 200-250MB
MLLM results did not improve with language filtering - the evaluation scores are all very close and this is probably just random variation
Omikuji results did improve by 2-3 percentage points in terms of F1@5 score and similarly for the nDCG scores with all three kinds of language filtering (when compared to the baseline, no filtering). The best results were obtained with Simplemma, pycld3 came second, Lingua in third place (both HA and LA). The differences between these three are not very dramatic, though.
There is almost no difference in the results between Lingua HA and LA modes.

Some preliminary conclusions:

Lingua isn't doing very well in this comparison. It's much slower than the alternatives, and gives the least benefit in terms of quality.
Simplemma is very promising, but I'm worried about the extra 800MB of memory it consumes. I wonder if this is necessary, given that it is just asked to identify which sentences are in Finnish and which are not (basically testing in_target_language(sentence, lang='fi') >= 0.5). Is there some bug that makes it use more memory than it should?

osma · 2022-09-29T13:31:27Z

Reported the huge memory usage in Simplemma as adbar/simplemma#19

pemistahl · 2022-09-29T20:31:58Z

@osma My library is slower because it is written in pure Python. pycld3 is written in C++ and simplemma uses mypyc to compile the Python modules to C extensions. I've already experimented with Cython and mypyc within Lingua, resulting in performance improvements which I will release in a later version.

You should also add Lingua's high accuracy mode to this comparison because this is what makes the library superior to most other language detection libraries. Memory consumption and running time will be higher but accuracy should be much better. It is kind of unfair to leave out the high accuracy mode and then stating gives the least benefit in terms of quality.

I've just released Lingua 1.1.3 which improves performance by roughly 30% compared to 1.1.2. So maybe you want to update your evaluation again.

osma · 2022-09-30T05:56:57Z

@osma My library is slower because it is written in pure Python. pycld3 is written in C++ and simplemma uses mypyc to compile the Python modules to C extensions. I've already experimented with Cython and mypyc within Lingua, resulting in performance improvements which I will release in a later version.

I understand. ~~Though Simplemma is also pure Python, and it was almost as fast as pycld3 in the comparison.~~ EDIT: Apologies, I read your text carelessly. Yes, Simplemma can use mypyc. But it seems to me that the Simplemma package files on PyPI, which is what I was using for the benchmark, are just pure Python. There is just a single any wheel, not architecture specific ones that you'd expect from compiled C extensions.

Simplemma uses a different, vocabulary-based approach for language detection though, not n-grams like most other language detectors including Lingua.

You should also add Lingua's high accuracy mode to this comparison because this is what makes the library superior to most other language detection libraries. Memory consumption and running time will be higher but accuracy should be much better. It is kind of unfair to leave out the high accuracy mode and then stating gives the least benefit in terms of quality.

I apologize for the harsh wording. I was focused on the downstream results - how the language detection, when applied as a filter for training and evaluation data, affects the quality of automated subject indexing. This may or may not correlate with quality benchmarks that focus purely on the accuracy of language detection. It is entirely possible that even a perfect language detector with 100% accuracy would achieve a low score on this downstream benchmark because there are so many confounding factors. As I also noted above, the differences between the three language detection approaches are quite small ("not very dramatic"). Using Simplemma instead of Lingua (low accuracy mode) with Omikuji improved the F1 score by 0.7 points (pycld3 was halfway between those) and some of these differences could well be just random variation.

I understand that Lingua's strong point is the high accuracy it achieves. But for an application like input preprocessing in Annif, it just doesn't make sense to spend so much computing resources (even just the low accuracy mode tested here) on the language detection part, when the maximum possible benefit is something like half a percentage point in F1 score compared to other, more lightweight approaches. Those resources would likely be better spent in other parts of the process, for example the classification algorithms themselves rather than the preprocessing.

I've just released Lingua 1.1.3 which improves performance by roughly 30% compared to 1.1.2. So maybe you want to update your evaluation again.

That is great news, congratulations!

I might consider doing another round (also including py3langid for example, or a possible new version of Simplemma with lower memory use) but for now I have other, more urgent tasks.

osma · 2022-09-30T06:13:24Z

I realized that I can just run the Omikuji evaluation part again with Lingua 1.1.3, without redoing the whole benchmark. Hang on...

osma · 2022-09-30T07:00:29Z

@pemistahl I upgraded to Lingua 1.1.3 and reran the Omikuji and MLLM evaluations. The Omikuji evaluation runtime decreased from 935 to 856 seconds and the MLLM runtime from 1210 to 1133 seconds. So it's an improvement for sure, but not super dramatic. I updated the table above. Evaluation scores didn't change at all.

Benchmark with Lingua in high accuracy mode is currently running, but as expected, it's taking a while...

osma · 2022-09-30T07:38:09Z

I finished the (partial) benchmark of Lingua in high-accuracy mode and edited the results table above accordingly. The runtime was at least an order of magnitude larger than in low-accuracy mode. Sorry @pemistahl but the result quality almost did not change at all. I don't think this is because Lingua would be less accurate than the others, it's just for some reason not very well suited to this particular task (and it's possible that tweaking the way it's used could improve the results).

adbar · 2022-10-04T12:03:10Z

@adulau I assume osma's comment answered your question.

@osma As you say mypyc can be used locally but I didn't enable it in the package release. I confirm the open question regarding memory in Simplemma.

As a side note, you could use hyperfine instead of usr/bin/time for the benchmarks.

osma · 2022-10-04T12:34:51Z

Thanks for the tip @adbar , I wasn't aware of hyperfine. Though it seems to me it will only measure execution time, not memory usage.

pemistahl · 2022-10-06T12:58:49Z

I apologize for the harsh wording.

No worries, @osma. I'm not resentful. :)

I understand that Lingua's strong point is the high accuracy it achieves. But for an application like input preprocessing in Annif, it just doesn't make sense to spend so much computing resources (even just the low accuracy mode tested here) on the language detection part, when the maximum possible benefit is something like half a percentage point in F1 score compared to other, more lightweight approaches. Those resources would likely be better spent in other parts of the process, for example the classification algorithms themselves rather than the preprocessing.

This is absolutely reasonable. Then Lingua is simply not the right tool for your job. That's ok. Luckily, there are enough language detectors to choose from, especially in the Python ecosystem.

I was curious and added Simplemma to my own evaluation of language detectors. As expected, the vocabulary-based approach is not as good as the ngram-based approach. The detection accuracy differs significantly between the languages. For Finnish, Simplemma is pretty accurate with 81% on average. But other languages, such as Spanish, for instance, do not perform so well. You can find the accuracy reports in the Lingua repo.

osma · 2022-10-06T13:26:09Z

This is absolutely reasonable. Then Lingua is simply not the right tool for your job. That's ok. Luckily, there are enough language detectors to choose from, especially in the Python ecosystem.

Yes, right. There's also the issue of API design - Simplemma provides the in_target_language function which is well suited for this specific task of filtering by language. It gives the estimated proportion of words in the text that are in the expected target language, and it only needs to load and make use of a single language model. I couldn't find anything similar in Lingua, so what I did in PR #615 was to use Lingua to detect the language of a sentence out of all languages it knows, which requires loading and using all available 75 language models (or at least a significant proportion of those). This means Lingua has to do a lot more work than Simplemma to accomplish the same and at least partly explains the difference in performance.

I was curious and added Simplemma to my own evaluation of language detectors. As expected, the vocabulary-based approach is not as good as the ngram-based approach. The detection accuracy differs significantly between the languages. For Finnish, Simplemma is pretty accurate with 81% on average. But other languages, such as Spanish, for instance, do not perform so well. You can find the accuracy reports in the Lingua repo.

This is great, thanks a lot! It's very useful to have a benchmark that is evaluated on many different detectors. I didn't expect Simplemma to be super accurate, as language detection is just an extra feature and the main purpose of the library is lemmatization. Also there seem to be large differences between the languages supported by Simplemma in the size of the vocabularies it knows about. It's quite natural that Simplemma has difficulties detecting languages with small vocabulary sizes. Finnish happens to be the language with the largest included vocabulary, though this has a lot to do also with the complex morphology of the language.

Would it be possible for you to also include the spent CPU time and memory for each detector in the benchmark results? At least for me those are important considerations, and also @adulau asked about it above, so others would likely be interested too. Since you run the same tests on every detector, the resource usage should be quite easily comparable, right?

adbar · 2022-10-06T14:38:49Z

Thanks @pemistahl for the detailed evaluation! I also like the bar plots you made to compare the results by language.

A quick remark on the methodology, you write that "a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted".
You could control sentence length and punctuation to make sure that you random sentences are all (1) actual sentences and (2) not too short (since you evaluate word pairs already). You could also take more instances of each language to make the evaluation possibly more reliable.
You may have a data issue on certain languages (e.g. Catalan or Malay), maybe the Projekt Wortschatz data isn't completely reliable, all detectors get it consistently wrong.

The fact that a n-gram approach works well on single words and on word pairs explains the overall performance of Lingua and others but not the relatively poor performance of CLD, that's interesting.

Simplemma works as expected IMO, it's a meaningful baseline or a good trade-off between simplicity and accuracy, and as @osma says language detection isn't its main purpose anyway.

pemistahl · 2022-10-13T08:32:38Z

There's also the issue of API design - Simplemma provides the in_target_language function which is well suited for this specific task of filtering by language. It gives the estimated proportion of words in the text that are in the expected target language, and it only needs to load and make use of a single language model.

I think it is not too difficult to implement something like this in Lingua. I will try to do that.

Would it be possible for you to also include the spent CPU time and memory for each detector in the benchmark results?

I have to rewrite some parts of the accuracy reports script to do so, but yes, it is surely possible. I don't know when I will have the time, though.

You may have a data issue on certain languages (e.g. Catalan or Malay), maybe the Projekt Wortschatz data isn't completely reliable, all detectors get it consistently wrong.

Maybe I will try to find a better source for test data for certain languages but that is not on my todo list at the moment. In the later future perhaps.

osma added the maintenance label Jun 21, 2022

osma added this to the Long term milestone Jun 21, 2022

adulau mentioned this issue Jun 24, 2022

Support for Python 3.10 bsolomon1124/pycld3#31

Open

osma mentioned this issue Aug 4, 2022

Reduce memory usage? pemistahl/lingua-py#50

Closed

osma added a commit that referenced this issue Aug 26, 2022

Use Lingua instead of pycld3 for language detection. Fixes #593

f6504b2

osma mentioned this issue Aug 26, 2022

Use Lingua instead of pycld3 for language detection #615

Closed

osma mentioned this issue Sep 1, 2022

Upgrade simplemma dependency and remove function cache #617

Closed

osma added a commit that referenced this issue Sep 23, 2022

Use Lingua instead of pycld3 for language detection. Fixes #593

e5f7aa1

osma mentioned this issue Sep 28, 2022

use Simplemma instead of pycld3 for language detection #626

Merged

osma closed this as completed in #626 Nov 15, 2022

osma modified the milestones: Long term, 0.60 Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace pycld3 dependency? #593

Replace pycld3 dependency? #593

osma commented Jun 21, 2022

osma commented Jun 21, 2022

osma commented Jun 22, 2022

osma commented Aug 4, 2022

pemistahl commented Aug 11, 2022

osma commented Aug 19, 2022

pemistahl commented Aug 22, 2022

osma commented Aug 23, 2022

pemistahl commented Aug 23, 2022

pemistahl commented Aug 26, 2022

osma commented Aug 26, 2022

osma commented Sep 1, 2022

adbar commented Sep 1, 2022 •

edited

Loading

osma commented Sep 28, 2022

adulau commented Sep 29, 2022

osma commented Sep 29, 2022 •

edited

Loading

osma commented Sep 29, 2022

pemistahl commented Sep 29, 2022

osma commented Sep 30, 2022 •

edited

Loading

osma commented Sep 30, 2022

osma commented Sep 30, 2022 •

edited

Loading

osma commented Sep 30, 2022 •

edited

Loading

adbar commented Oct 4, 2022

osma commented Oct 4, 2022

pemistahl commented Oct 6, 2022

osma commented Oct 6, 2022

adbar commented Oct 6, 2022

pemistahl commented Oct 13, 2022

Replace pycld3 dependency? #593

Replace pycld3 dependency? #593

Comments

osma commented Jun 21, 2022

osma commented Jun 21, 2022

osma commented Jun 22, 2022

osma commented Aug 4, 2022

pemistahl commented Aug 11, 2022

osma commented Aug 19, 2022

pemistahl commented Aug 22, 2022

osma commented Aug 23, 2022

pemistahl commented Aug 23, 2022

pemistahl commented Aug 26, 2022

osma commented Aug 26, 2022

osma commented Sep 1, 2022

adbar commented Sep 1, 2022 • edited Loading

osma commented Sep 28, 2022

adulau commented Sep 29, 2022

osma commented Sep 29, 2022 • edited Loading

osma commented Sep 29, 2022

pemistahl commented Sep 29, 2022

osma commented Sep 30, 2022 • edited Loading

osma commented Sep 30, 2022

osma commented Sep 30, 2022 • edited Loading

osma commented Sep 30, 2022 • edited Loading

adbar commented Oct 4, 2022

osma commented Oct 4, 2022

pemistahl commented Oct 6, 2022

osma commented Oct 6, 2022

adbar commented Oct 6, 2022

pemistahl commented Oct 13, 2022

adbar commented Sep 1, 2022 •

edited

Loading

osma commented Sep 29, 2022 •

edited

Loading

osma commented Sep 30, 2022 •

edited

Loading

osma commented Sep 30, 2022 •

edited

Loading

osma commented Sep 30, 2022 •

edited

Loading