Adding enhanced cross-lingual retrieval benchmark by merging retrieval pools from different languages #1625

yjoonjang · 2024-12-24T06:43:54Z

Hi MTEB maintainers @KennethEnevoldsen, @Muennighoff

@seongtaehong and I were considering a way to make cross-lingual retrieval tasks more challenging by merging retrieval pools from two different languages.

Here’s the idea:

The task would be to retrieve two gold passages from a retrieval pool composed of content in two different languages.
The retrieval pool would consist of pairs of passages that have the same meaning but are written in different languages (e.g., StrategyQA and Ko-StrategyQA, with the latter being the Korean translation of StrategyQA).
Given a query in Korean, the model would need to retrieve the top 2 passages, ensuring the retrieved passages are in different languages. (And same for the query in English)

We believe this approach reflects a more realistic scenario, as many retrieval pools in the real world are derived from web crawling, and such pools naturally include data in multiple languages.
What are your thoughts on this idea? Let me know if you'd like me to adjust anything further!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding enhanced cross-lingual retrieval benchmark by merging retrieval pools from different languages #1625

Adding enhanced cross-lingual retrieval benchmark by merging retrieval pools from different languages #1625

yjoonjang commented Dec 24, 2024 •

edited

Loading

Adding enhanced cross-lingual retrieval benchmark by merging retrieval pools from different languages #1625

Adding enhanced cross-lingual retrieval benchmark by merging retrieval pools from different languages #1625

Comments

yjoonjang commented Dec 24, 2024 • edited Loading

yjoonjang commented Dec 24, 2024 •

edited

Loading