Does bertopic(language= multilingual) perform well on all 50+ languages? Or is it still better with English? #2157
Replies: 1 comment
-
It is important to first explore what it means to select In practice, the distribution of languages is typically quite different and most multilingual embedding models are trained mostly on English data and tend to perform a bit better there. |
Beta Was this translation helpful? Give feedback.
-
When using BERTopic with language=multilingual, is the model still more effective at assigning topics to English texts compared to other languages in a multilingual dataset? For example, in a dataset containing English, French, and German documents, does BERTopic tend to generate more coherent or accurate topics for the English texts (or, does it handle English better in any other ways)? If so, what factors contribute to this discrepancy? I am trying to understand whether there could be still a bias for English even when using language= multilingual
Beta Was this translation helpful? Give feedback.
All reactions