Compute percentage of languages present in a document (paragraph) #159
Unanswered
thangld201
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I'm trying to compute the percentage of each language appearing in a document. My current use cases including two known language and a document which has the two languages mixed in (code switching). I'm training an ML model to try making the output monolingual (more leaned towards a certain language), so I need a reliable measure to estimates whether the ML model is making progress or not (language percent changes positively). Currently, I use lingua with the compute_language_confidence_values() function but the prediction is quite poor.
For example, given a piece of text in Japanese and English:
So it's not quite correct (should be 0.8-0.2 or something similar), do you have any advice on how I can improve/modify the library for my use case ?
Beta Was this translation helpful? Give feedback.
All reactions