Welcome to SemEval-2025 Task-3 — Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Welcome to the official shared task website for Mu-SHROOM, a SemEval-2025 shared task!

Mu-SHROOM stands for "Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes". Mu-SHROOM will invite participants to detect hallucination spans in the outputs of instruction-tuned LLMs in a multilingual context. This shared task builds upon our previous iteration, SHROOM, with a few key changes:

We're looking at multiple languages: Arabic (Modern standard), Chinese (Mandarin), English, Finnish, French, German, Hindi, Italian, Spanish, and Swedish;
We're now focusing on LLM outputs;
Participants will have to predict where the hallucination occurs.

This website is under construction. More information will be available soon.

What is Mu-SHROOM?

The task consists in detecting spans of text corresponding to hallucinations. Participants are asked to determine which parts of a given text produced by LLMs constitute hallucinations. The task is held in multi-lingual and multi-model context, i.e., we provide data in multiple languages and produced by a variety of public-weights LLMs.´

In practice, we provide an LLM output (as a string of characters, a list of tokens, and a list of logits), and participants have to compute, for every character in the LLM output string, the probability that it is marked as a hallucination. Participants are free to use any approach they deem appropriate, including using external resources.

How will participants be evaluated?

Participants will be ranked along two (character-level) metrics:

intersection-over-union of characters marked as hallucinations in the gold reference vs. predicted as such
how well the probability assigned by the participants' system that a character is part of a hallucination correlates with the empirical probabilities observed in our annotators.

Rankings and submissions will be done separately per language.

Participants can also download the scoring program on its own here for reference and developing their systems.

Participant info

Register ahead of time on our submission website

Want to be kept in the loop? Join our Google group mailing list or the shared task Slack! We also have a Twitter acount.

Data

Below are links to access the data already released, as well as provisional expected release dates for future splits. Do note that release dates are subject to change.

Dataset split	Access
Sample set	download (v1)
Validation set	download (v2)
Unlabeled train set	download (v1)
Unlabeled test set	To be published (ETA Jan 10th)
Labeled test set	To be published (ETA Feb 1st)

We are releasing a participant kit, which we'll keep building up. For now, it contains the scoring program as well as a random baseline, you can download it from here.

Important dates

This information is subject to change.

Sample data available: 15 July 2024
Validaiton data ready: 2 September 2024
Evaluation start: 10 January 2025
Evaluation end: 31 January 2025
Paper submission due: 28 February 2025 (TBC)
Notification to authors: 31 March 2025 (TBC)
Camera ready due: 21 April 2025 (TBC)
SemEval workshop: Summer 2025 (co-located with a major NLP conference)

Organizers of the shared task

Raúl Vázquez, University of Helsinki, Finland
Timothee Mickus, University of Helsinki, Finland
Elaine Zosa, SILO AI, Finland
Teemu Vahtola, University of Helsinki, Finland
Jörg Tiedemann, University of Helsinki, Finland
Aman Sinha, Université de Lorraine, France
Vincent Segonne, Université Bretagne Sud, France
Fernando Sánchez-Vega, CIMAT A. C., Mexico
Alessandro Raganato, University of Milano-Bicocca, Italy
Jussi Karlgren, SILO AI, Finland
Shaoxiong Ji, University of Helsinki, Finland
Liane Guillou, University of Edinburgh, UK
Joseph Attieh, University of Helsinki, Finland
Marianna Apidianaki, University of Pennsylvania, USA

Looking for something else?

The website for the previous iteration of the shared task is available here.

The logo is available here (download); we encourage participants to use it where relevant (esp. in your posters)!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Welcome to SemEval-2025 Task-3 — Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

What is Mu-SHROOM?

How will participants be evaluated?

Participant info

Data

Important dates

Organizers of the shared task

Looking for something else?

Files

index.md

Latest commit

History

index.md

File metadata and controls

Welcome to SemEval-2025 Task-3 — Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

What is Mu-SHROOM?

How will participants be evaluated?

Participant info

Data

Important dates

Organizers of the shared task

Looking for something else?