Support Language.OTHER
category
#240
ivan-kleshnin
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm parsing multi-lingual texts, from which I'm interested in detecting only a couple of selected languages, with high precision. Other languages should fall under
None
or"Other"
category.Unless I miss something, lingua-py does not support this (common) case properly.
Say, I'm interested in
Language.ENGLISH
andLanguage.RUSSIAN
and the corpus contains a dozen of languages. Whenlingua-py
encounters a text in a non-supported language, it assigns a random confidence numbers to given options. Often it's around0.75
to0.25
so adding a min limit can help. But sometimes it's a very high confidence, up to1
...is detected as ENG with confidence of
1
, given just the above two alternatives. Supposedly due to latin vs cyrillic heuristics. If I addLanguage.GERMAN
, it's correctly detected as DEU with0.99
confidence. The issue, as stated, is that I can't simply add more and more languages, slowing down the process of extraction. The more languages lingua-py has to select from – the slower it is, I totally confirm what the docs claim about this.So, the approach to consider, is to add
Language.OTHER
category, that would likely absorb unresonably high confidence numbers and help to resolve such cases. Please let me know if you think the above can be solved differently, or if there's already a solution in the library. I've not found it myself.Beta Was this translation helpful? Give feedback.
All reactions