-
Hi! Thanks for the awesome library 👍 Could you please clarify something for me? I. Is it possible to make confidence scores more meaningful? Right now this sample code
Output: 0.99, 0.01. But I would expect both scores to be relatively low - that would be so much more helpful. Right now I'm working on a task, where I have the opportunity to suggest, which languages are most likely to occur in a given text, but it's not a solid rule. What I wanted to do to improve detection performance, is to obtain confidence scores with a detector, containing limited set of languages, and if the score is low, then pass the text to a larger and slower from_all_languages() detector. But with current behaviour it seems impossible. Am I missing something? II. How does the caching work? I notice that subsequent detector calls with same set of texts complete almost instantaneously, but I failed to find the exact implementation. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @nick-maykr, thank you for your questions and your interest in my library.
Does that answer your questions? :) |
Beta Was this translation helpful? Give feedback.
Hi @nick-maykr, thank you for your questions and your interest in my library.
You are not using the latest Lingua release 1.3. I have reworked the confidence score calculation. For your code above, the score for English is now 0.72 and for German 0.28. If you build the detector from these two languages only, it does not know anything about the existence of the other languages. That's why the probability for English is not as low as you would expect. When building the detector from all languages, the probability for English is reduced to 0.003 and for German to 0.001. For Italian, it would be 0.78. So the confidence score for a single language is always calculated relatively to the score…