Compute percentage of languages present in a document (paragraph) #159

thangld201 · 2023-09-03T17:41:35Z

thangld201
Sep 3, 2023

Hi, I'm trying to compute the percentage of each language appearing in a document. My current use cases including two known language and a document which has the two languages mixed in (code switching). I'm training an ML model to try making the output monolingual (more leaned towards a certain language), so I need a reliable measure to estimates whether the ML model is making progress or not (language percent changes positively). Currently, I use lingua with the compute_language_confidence_values() function but the prediction is quite poor.

For example, given a piece of text in Japanese and English:

from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.JAPANESE]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
detector.compute_language_confidence_values("わかりません hey do you understand me hey oh really")
>>> [ConfidenceValue(language=Language.ENGLISH, value=1.0),
 ConfidenceValue(language=Language.JAPANESE, value=0.0)]

So it's not quite correct (should be 0.8-0.2 or something similar), do you have any advice on how I can improve/modify the library for my use case ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute percentage of languages present in a document (paragraph) #159

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Compute percentage of languages present in a document (paragraph) #159

thangld201 Sep 3, 2023

Replies: 0 comments

thangld201
Sep 3, 2023