Support `Language.OTHER` category #240

ivan-kleshnin · 2024-10-09T12:51:18Z

ivan-kleshnin
Oct 9, 2024

I'm parsing multi-lingual texts, from which I'm interested in detecting only a couple of selected languages, with high precision. Other languages should fall under None or "Other" category.

Unless I miss something, lingua-py does not support this (common) case properly.

Say, I'm interested in Language.ENGLISH andLanguage.RUSSIAN and the corpus contains a dozen of languages. When lingua-py encounters a text in a non-supported language, it assigns a random confidence numbers to given options. Often it's around 0.75 to 0.25 so adding a min limit can help. But sometimes it's a very high confidence, up to 1...

Backend Entwickler, Wanderer und Abenteurer

is detected as ENG with confidence of 1, given just the above two alternatives. Supposedly due to latin vs cyrillic heuristics. If I add Language.GERMAN, it's correctly detected as DEU with 0.99 confidence. The issue, as stated, is that I can't simply add more and more languages, slowing down the process of extraction. The more languages lingua-py has to select from – the slower it is, I totally confirm what the docs claim about this.

So, the approach to consider, is to add Language.OTHER category, that would likely absorb unresonably high confidence numbers and help to resolve such cases. Please let me know if you think the above can be solved differently, or if there's already a solution in the library. I've not found it myself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `Language.OTHER` category #240

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Support Language.OTHER category #240

ivan-kleshnin Oct 9, 2024

Replies: 0 comments

Support `Language.OTHER` category #240

ivan-kleshnin
Oct 9, 2024