-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Lingua instead of pycld3 for language detection #615
Conversation
Codecov ReportBase: 99.61% // Head: 99.59% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #615 +/- ##
==========================================
- Coverage 99.61% 99.59% -0.03%
==========================================
Files 87 87
Lines 6038 5946 -92
==========================================
- Hits 6015 5922 -93
- Misses 23 24 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@@ -32,7 +32,6 @@ def test_lang_filter(project): | |||
Kansalliskirjasto on kaikille avoin kulttuuriperintöorganisaatio, joka | |||
palvelee valtakunnallisesti kansalaisia, tiedeyhteisöjä ja muita | |||
yhteiskunnan toimijoita. | |||
Abc defghij klmnopqr stuwxyz abc defghij klmnopqr stuwxyz. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to change this test. This is a nonsensical sentence that pycld3 is unsure about (.is_reliable == False
), so the language filter gives it the benefit of the doubt and retains it. Lingua simply identifies it as Swahili, so it gets stripped.
In the Lingua documentation I can't see a direct way of telling whether Lingua is unsure about some input; there is the minimum relative distance parameter which could be tuned, but I'm not sure that it would help with nonsensical input. For example, when converting PDF documents to text, it's quite likely to result in garbage "sentences" that aren't in any real language.
Thank you @osma for adding my library to your evaluation. :) It is not surprising that CLD3 is faster than Lingua. CLD3 has been implemented in C++ whereas my library is pure Python only (with exception of the internally used NumPy arrays). You said that you would favor a pure Python library for language detection. Such a library will always be slower than one implemented in a low-level language. So there will always be compromises you have to make. As soon as PyO3 supports exporting Rust enums as Python enums, I will create Python bindings for my Rust implementation of Lingua. This will be significantly faster than the pure Python version. It seems that you mainly want to classify large documents consisting of multiple sentences. For such kind of textual input, the high accuracy mode does not achieve much benefit. It's better suited for short texts such as tweets, for instance. So the advantages of Lingua compared to other language detectors do not pay off for you. That's ok. I think it's better then if you stick with CLD3 to benefit from the better detection speed. |
Understood. But I think the current Lingua implementation (with NumPy vectors) is slower than it needs to be because of the O(log(n)) lookups - having to do binary searches in big sorted arrays. This is not a question of implementation language but of algorithmic efficiency. Even pure Python (or in this case, helped along by NumPy) can be quite fast. I wrote some ideas about further optimization of Lingua in this discussion.
The problem here is that sticking to CLD3 is not a good option, as explained in the OP of #593 - its most active Python binding library (pycld3) appears to not be actively maintained anymore, and the other ones (cld3, gcld3) are even older. pycld3 doesn't work with Python 3.10. So unless someone starts maintaining it again, we will need to switch to something else. |
Hi @osma, I have just released Lingua 1.1.2 which removes the most significant performance problems of the previous version. The language models are now stored on disk as serialized NumPy arrays instead of JSON. This reduces the preloading time of the language models significantly (between 1 and 2 seconds for all models on my machine). I have also removed a bottleneck in the language detection code which makes language detection 40 % faster, approximately. Can you please do your evaluation again with the new version? Would you now consider switching to my library? Thanks. :) |
Thanks @pemistahl for the update, that is great news! I will try to do a new round of experiments soon, comparing language filtering with either pycld3, Lingua or the recently added language detection functionality in Simplemma. This time I will use a dataset that actually should benefit from the filtering - the tutorial data set I used above was a bit disappointing in this respect. |
b922152
to
56688e2
Compare
Rebased this PR branch on current |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
This didn't work well according to the benchmarks, and now the PR branch is also in a conflict with the master branch due to the Black reformatting. It doesn't make sense to spend time salvaging this, so I'll just close the PR. |
This draft PR fixes #593 by switching from the pycld3 language detection library to Lingua (by @pemistahl).
Lingua is used in the low accuracy mode, because it is much faster than the high accuracy mode and needs a lot less memory. I tested the high accuracy mode very briefly but just the startup overhead was so high (tens of seconds) that I considered it a non-starter.
I did a little benchmarking using the Annif tutorial yso-nlf data set and two project configurations used in the tutorial with two backend algorithms, MLLM and Omikuji Parabel. I compared current master (which uses pycld3) to this PR branch which uses Lingua 1.1.1. As a baseline, I also used project configurations with no language filtering. Here are the project configurations:
For the unfiltered baseline I used
transform=limit(5000)
instead.Here are some performance stats (total user time over all CPU cores and maximum resident set size) that I measured using
/usr/bin/time -v
:Here are the evaluation results (running
annif eval
on the 300 documents in the test set and measuring F1@5 and nDCG scores - higher is better):The good news:
The bad news:
result.iso_code_639_1.name.lower()
does the trick. pycld3 returns language codes directly, which makes the API easier to use.I think the take home message is that if Lingua could be made faster still for the detection process, then we could consider switching to it. Right now it seems that the performance cost is quite high. It would also be nice to identify a data set where the language filtering actually improves results; we could then measure whether Lingua does this better than pycld3 or not. This data set was not a good choice in that respect.