Use Lingua instead of pycld3 for language detection #615

osma · 2022-08-26T14:08:40Z

This draft PR fixes #593 by switching from the pycld3 language detection library to Lingua (by @pemistahl).

Lingua is used in the low accuracy mode, because it is much faster than the high accuracy mode and needs a lot less memory. I tested the high accuracy mode very briefly but just the startup overhead was so high (tens of seconds) that I considered it a non-starter.

I did a little benchmarking using the Annif tutorial yso-nlf data set and two project configurations used in the tutorial with two backend algorithms, MLLM and Omikuji Parabel. I compared current master (which uses pycld3) to this PR branch which uses Lingua 1.1.1. As a baseline, I also used project configurations with no language filtering. Here are the project configurations:

[yso-mllm-en-filter]
name=YSO MLLM project
language=en
backend=mllm
vocab=yso-en
analyzer=snowball(english)
transform=limit(10000),filter_lang,limit(5000)

[yso-omikuji-parabel-en-filter]
name=Omikuji Parabel English
language=en
backend=omikuji
analyzer=snowball(english)
vocab=yso-en
transform=limit(10000),filter_lang,limit(5000)

For the unfiltered baseline I used transform=limit(5000) instead.

Here are some performance stats (total user time over all CPU cores and maximum resident set size) that I measured using /usr/bin/time -v:

operation	notes	no filter time	no filter mem	pycld3 time	pycld3 mem	lingua-low time	lingua-low mem
pytest	optionals: dev,omikuji,pycld3/lingua	-	-	76	1302904	77	1299056
loadvoc yso	yso-skos.ttl	121	2468144	-	-	120	2466232
train mllm	-d 2000 -j 8	646	1876336	608	1879700	2176	1893320
suggest mllm	2017-D-52518.txt	7	283548	7	278576	14	291960
eval mllm	-j 8	128	360836	131	361688	624	359360
train omikuji	-j 8 yso-finna-small.tsv.gz	124	663368	125	682708	131	684188
suggest omikuji	2017-D-52518.txt	5	400784	6	396836	14	410744
eval omikuji	-j 8	22	483988	29	483704	513	481508

Here are the evaluation results (running annif eval on the 300 documents in the test set and measuring F1@5 and nDCG scores - higher is better):

Project type	no filter f1@5	no filter ndcg	pycld3 f1@5	pycld3 ndcg	lingua-low f1@5	lingua-low ndcg
mllm	0.3276	0.4334	0.3236	0.4282	0.3183	0.4228
omikuji	0.2562	0.3543	0.2435	0.3287	0.2524	0.3385

The good news:

Lingua starts up quickly (in low-accuracy mode)
Lingua doesn't use any more memory than pycld3 (in low-accuracy mode)

The bad news:

Lingua is still a lot slower than pycld3 in the grunt work of filtering long documents sentence by sentence. For example, when training MLLM with 2000 documents (truncated to max 10000 characters each by the limit filter), the user time increased from ~600 to ~2100 seconds. Likewise, evaluation time on 300 documents increased by ~500 seconds. For suggest operations on a single document, the increase was ~7 seconds (but this most likely includes some initialization overhead, so the next document would have been processed faster).
In general, this experiment didn't show any benefit of language filtering. The evaluation results were actually best for the baseline experiment with no filtering; using either pycld3 or Lingua just made the results worse. For other data sets and languages, the situation could be different.
Maybe this is nitpicking, but it was surprisingly hard to get a standard lowercase ISO 639-1 language code out of Lingua. Eventually I found out that using an expression like result.iso_code_639_1.name.lower() does the trick. pycld3 returns language codes directly, which makes the API easier to use.

I think the take home message is that if Lingua could be made faster still for the detection process, then we could consider switching to it. Right now it seems that the performance cost is quite high. It would also be nice to identify a data set where the language filtering actually improves results; we could then measure whether Lingua does this better than pycld3 or not. This data set was not a good choice in that respect.

codecov · 2022-08-26T14:12:54Z

Codecov Report

Base: 99.61% // Head: 99.59% // Decreases project coverage by -0.02% ⚠️

Coverage data is based on head (3bbe813) compared to base (ec10014).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #615      +/-   ##
==========================================
- Coverage   99.61%   99.59%   -0.03%     
==========================================
  Files          87       87              
  Lines        6038     5946      -92     
==========================================
- Hits         6015     5922      -93     
- Misses         23       24       +1

Impacted Files	Coverage Δ
annif/transform/__init__.py	`100.00% <ø> (ø)`
tests/test_transform_langfilter.py	`100.00% <ø> (ø)`
annif/transform/langfilter.py	`96.42% <100.00%> (-3.58%)`	⬇️
annif/cli.py	`99.67% <0.00%> (-0.02%)`	⬇️
tests/test_cli.py	`100.00% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

osma · 2022-08-26T14:37:01Z

tests/test_transform_langfilter.py

@@ -32,7 +32,6 @@ def test_lang_filter(project):
        Kansalliskirjasto on kaikille avoin kulttuuriperintöorganisaatio, joka
        palvelee valtakunnallisesti kansalaisia, tiedeyhteisöjä ja muita
        yhteiskunnan toimijoita.
-        Abc defghij klmnopqr stuwxyz abc defghij klmnopqr stuwxyz.


I had to change this test. This is a nonsensical sentence that pycld3 is unsure about (.is_reliable == False), so the language filter gives it the benefit of the doubt and retains it. Lingua simply identifies it as Swahili, so it gets stripped.

In the Lingua documentation I can't see a direct way of telling whether Lingua is unsure about some input; there is the minimum relative distance parameter which could be tuned, but I'm not sure that it would help with nonsensical input. For example, when converting PDF documents to text, it's quite likely to result in garbage "sentences" that aren't in any real language.

pemistahl · 2022-08-26T15:47:13Z

Thank you @osma for adding my library to your evaluation. :)

It is not surprising that CLD3 is faster than Lingua. CLD3 has been implemented in C++ whereas my library is pure Python only (with exception of the internally used NumPy arrays). You said that you would favor a pure Python library for language detection. Such a library will always be slower than one implemented in a low-level language. So there will always be compromises you have to make. As soon as PyO3 supports exporting Rust enums as Python enums, I will create Python bindings for my Rust implementation of Lingua. This will be significantly faster than the pure Python version.

It seems that you mainly want to classify large documents consisting of multiple sentences. For such kind of textual input, the high accuracy mode does not achieve much benefit. It's better suited for short texts such as tweets, for instance. So the advantages of Lingua compared to other language detectors do not pay off for you. That's ok. I think it's better then if you stick with CLD3 to benefit from the better detection speed.

osma · 2022-08-26T18:23:49Z

It is not surprising that CLD3 is faster than Lingua. CLD3 has been implemented in C++ whereas my library is pure Python only (with exception of the internally used NumPy arrays). You said that you would favor a pure Python library for language detection. Such a library will always be slower than one implemented in a low-level language. So there will always be compromises you have to make. As soon as PyO3 supports PyO3/pyo3#417, I will create Python bindings for my Rust implementation of Lingua. This will be significantly faster than the pure Python version.

Understood. But I think the current Lingua implementation (with NumPy vectors) is slower than it needs to be because of the O(log(n)) lookups - having to do binary searches in big sorted arrays. This is not a question of implementation language but of algorithmic efficiency. Even pure Python (or in this case, helped along by NumPy) can be quite fast. I wrote some ideas about further optimization of Lingua in this discussion.

It seems that you mainly want to classify large documents consisting of multiple sentences. For such kind of textual input, the high accuracy mode does not achieve much benefit. It's better suited for short texts such as tweets, for instance. So the advantages of Lingua compared to other language detectors do not pay off for you. That's ok. I think it's better then if you stick with CLD3 to benefit from the better detection speed.

The problem here is that sticking to CLD3 is not a good option, as explained in the OP of #593 - its most active Python binding library (pycld3) appears to not be actively maintained anymore, and the other ones (cld3, gcld3) are even older. pycld3 doesn't work with Python 3.10. So unless someone starts maintaining it again, we will need to switch to something else.

pemistahl · 2022-09-06T20:40:26Z

Hi @osma,

I have just released Lingua 1.1.2 which removes the most significant performance problems of the previous version. The language models are now stored on disk as serialized NumPy arrays instead of JSON. This reduces the preloading time of the language models significantly (between 1 and 2 seconds for all models on my machine). I have also removed a bottleneck in the language detection code which makes language detection 40 % faster, approximately.

Can you please do your evaluation again with the new version? Would you now consider switching to my library?

Thanks. :)

osma · 2022-09-23T11:24:37Z

Thanks @pemistahl for the update, that is great news!

I will try to do a new round of experiments soon, comparing language filtering with either pycld3, Lingua or the recently added language detection functionality in Simplemma. This time I will use a dataset that actually should benefit from the filtering - the tutorial data set I used above was a bit disappointing in this respect.

osma · 2022-09-23T13:32:04Z

Rebased this PR branch on current master and force-pushed. Also upgraded to Lingua 1.1.2.

sonarcloud · 2022-09-30T06:16:15Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

osma · 2022-11-11T14:39:07Z

This didn't work well according to the benchmarks, and now the PR branch is also in a conflict with the master branch due to the Black reformatting. It doesn't make sense to spend time salvaging this, so I'll just close the PR.

osma added the enhancement label Aug 26, 2022

osma mentioned this pull request Aug 26, 2022

Replace pycld3 dependency? #593

Closed

osma commented Aug 26, 2022

View reviewed changes

adbar mentioned this pull request Sep 1, 2022

Upgrade simplemma dependency and remove function cache #617

Closed

osma added 4 commits September 23, 2022 16:26

Use Lingua instead of pycld3 for language detection. Fixes #593

e5f7aa1

enable tests for Lingua under GitHub Actions

28f5781

adapt Dockerfile for use with Lingua instead of pycld3

a5f73e0

upgrade to Lingua 1.1.2

56688e2

osma force-pushed the issue593-lingua branch from b922152 to 56688e2 Compare September 23, 2022 13:31

osma mentioned this pull request Sep 28, 2022

use Simplemma instead of pycld3 for language detection #626

Merged

update to Lingua 1.1.3

3bbe813

osma closed this Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Lingua instead of pycld3 for language detection #615

Use Lingua instead of pycld3 for language detection #615

osma commented Aug 26, 2022

codecov bot commented Aug 26, 2022 •

edited

Loading

osma Aug 26, 2022

pemistahl commented Aug 26, 2022

osma commented Aug 26, 2022

pemistahl commented Sep 6, 2022

osma commented Sep 23, 2022

osma commented Sep 23, 2022

sonarcloud bot commented Sep 30, 2022

osma commented Nov 11, 2022

Use Lingua instead of pycld3 for language detection #615

Use Lingua instead of pycld3 for language detection #615

Conversation

osma commented Aug 26, 2022

codecov bot commented Aug 26, 2022 • edited Loading

Codecov Report

osma Aug 26, 2022

Choose a reason for hiding this comment

pemistahl commented Aug 26, 2022

osma commented Aug 26, 2022

pemistahl commented Sep 6, 2022

osma commented Sep 23, 2022

osma commented Sep 23, 2022

sonarcloud bot commented Sep 30, 2022

osma commented Nov 11, 2022

codecov bot commented Aug 26, 2022 •

edited

Loading