Meaningful confidence scores #110

n-splv · 2023-01-09T12:11:02Z

n-splv
Jan 9, 2023

Hi! Thanks for the awesome library 👍

Could you please clarify something for me?

I. Is it possible to make confidence scores more meaningful? Right now this sample code

languages = [Language.ENGLISH, Language.GERMAN]
d = LanguageDetectorBuilder.from_languages(*languages).with_preloaded_language_models().build() 

# Italian text
text = 'Cibo buono, ottimi i prezzi'

print(
    d.compute_language_confidence(text, Language.ENGLISH),   
    d.compute_language_confidence(text, Language.GERMAN) 
)

Output: 0.99, 0.01.

But I would expect both scores to be relatively low - that would be so much more helpful. Right now I'm working on a task, where I have the opportunity to suggest, which languages are most likely to occur in a given text, but it's not a solid rule. What I wanted to do to improve detection performance, is to obtain confidence scores with a detector, containing limited set of languages, and if the score is low, then pass the text to a larger and slower from_all_languages() detector. But with current behaviour it seems impossible. Am I missing something?

II. How does the caching work? I notice that subsequent detector calls with same set of texts complete almost instantaneously, but I failed to find the exact implementation.

Answered by pemistahl

Jan 9, 2023

Hi @nick-maykr, thank you for your questions and your interest in my library.

You are not using the latest Lingua release 1.3. I have reworked the confidence score calculation. For your code above, the score for English is now 0.72 and for German 0.28. If you build the detector from these two languages only, it does not know anything about the existence of the other languages. That's why the probability for English is not as low as you would expect. When building the detector from all languages, the probability for English is reduced to 0.003 and for German to 0.001. For Italian, it would be 0.78. So the confidence score for a single language is always calculated relatively to the score…

View full answer

pemistahl · 2023-01-09T19:44:16Z

pemistahl
Jan 9, 2023
Maintainer

Hi @nick-maykr, thank you for your questions and your interest in my library.

You are not using the latest Lingua release 1.3. I have reworked the confidence score calculation. For your code above, the score for English is now 0.72 and for German 0.28. If you build the detector from these two languages only, it does not know anything about the existence of the other languages. That's why the probability for English is not as low as you would expect. When building the detector from all languages, the probability for English is reduced to 0.003 and for German to 0.001. For Italian, it would be 0.78. So the confidence score for a single language is always calculated relatively to the scores of the other languages. It's difficult to come up with a metric that works independently for just a single language. I'm working on it but it may take a while. For the time being, I suggest you to build the detector from more than two languages. If your texts consist of the Latin alphabet only, you can build the detector with the method from_all_languages_with_latin_script().
Multiple instances of LanguageDetector access the same language models in memory. So the models are loaded only once from disk. They are stored and searched in NumPy arrays. Additionally, there is a dictionary cache that allows faster retrieval of the probabilities of already seen ngrams. NumPy arrays are slower to be searched but consume significantly less memory compared to dictionaries.

Does that answer your questions? :)

2 replies

n-splv Jan 19, 2023
Author

Hi! Thanks for taking your time to respond.

Got it! Glad to hear that you are working on such a feature, that would be a game changer for folks like me, who work with texts from an undetermined scope of languages. That way I could build two detectors, a fast one with say top 5 languages and a large one from all languages. Then call the slow large detector only if the first one is not confident enough. I think you get the idea :)
Sounds great. Is it possible though to control the caching behaviour through some arguments, e.g. to limit its capacity? I'm a little bit worried that a long-running detection service might eventually run out of memory because a lot of ngrams would get cached.

pemistahl Jan 20, 2023
Maintainer

I could make the caching configurable, yes. I will think about it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meaningful confidence scores #110

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Meaningful confidence scores #110

n-splv Jan 9, 2023

Replies: 1 comment · 2 replies

pemistahl Jan 9, 2023 Maintainer

n-splv Jan 19, 2023 Author

pemistahl Jan 20, 2023 Maintainer

n-splv
Jan 9, 2023

Replies: 1 comment 2 replies

pemistahl
Jan 9, 2023
Maintainer

n-splv Jan 19, 2023
Author

pemistahl Jan 20, 2023
Maintainer