While automatic summarization evaluation methods developed for English are routinely applied to other languages, this is the first attempt to systematically quantify their panlinguistic efficacy. We take a summarization corpus for eight different languages (EN, ID, FR, TR, ZH, RU, DE, ES), and manually annotate generated summaries for focus (precision) and coverage (recall).
Fajri Koto, Jey Han Lau, and Timothy Baldwin. Evaluating the Efficacy of Summarization Evaluation across Languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
We examine content-based summarization evaluation from the aspects of precision and recall, in the form of focus and coverage to compare system-generated summaries to groundtruth summaries.
- We use the customized direct assessment method for annotation.
- We use Amazon Mechanical Turk for annotation. You can find the MTurk user interface at
mturk/html
. - Jupyter notebooks number 0-3 are used to pre- and post-process the MTurk annotation
- In this repository, we only provide annotation process for ID, FR, TR, ZH, RU, DE, ES. Annotation process for EN will be released seperately because the data is from the FFCI paper.
- You can find all annotation result in folder
resulting_data
. - The provided scores are the normalized z-score.
- You can use jupyter notebooks number 5-6 to compute traditional metrics and its Pearson and Spearman correlations.
- Please note that for ZH and RU you need to convert all character/word to its latin form by using jupyter notebook
5. transform_RU_and_ZH.ipynb
.
- We already provide all of the output of BERTScore and MoverScore in folder
bert_score
andmover_score
, respectively - You can use jupyter notebook number 7-8 to compute its Pearson and Spearman correlations.