The code in this repo requires Python 3.10 or higher. We recommend creating a new conda environment as follows:
conda create -n guardians-mt-eval python=3.10
conda activate guardians-mt-eval
pip install --upgrade pip
pip install -e .
We trained the sentinel metrics using the Direct Assessments (DA) and Multidimensional Quality Metrics (MQM) human annotations downloaded from the COMET Github repository.
We trained the following sentinel metrics:
HF Model Name | Input | Training Data |
---|---|---|
sapienzanlp/sentinel-src-da |
Source text | DA WMT17-20 |
sapienzanlp/sentinel-src-mqm |
Source text | DA WMT17-20 + MQM WMT20-22 |
sapienzanlp/sentinel-cand-da |
Candidate translation | DA WMT17-20 |
sapienzanlp/sentinel-cand-mqm |
Candidate translation | DA WMT17-20 + MQM WMT20-22 |
sapienzanlp/sentinel-ref-da |
Reference translation | DA WMT17-20 |
sapienzanlp/sentinel-ref-mqm |
Reference translation | DA WMT17-20 + MQM WMT20-22 |
All metrics are based on XLM-RoBERTa large. All MQM sentinel metrics are further fine-tuned on MQM data starting from the DA-based sentinel metrics. All metrics can be found on 🤗 Hugging Face.
Except for sentinel-metric-train
, all CLI commands included within this package require cloning and installing our fork of the Google WMT Metrics evaluation repository. To do this, execute the following commands:
git clone https://github.com/prosho-97/mt-metrics-eval.git
cd mt-metrics-eval
pip install .
Then, download the WMT data following the instructions in the Downloading the data section of the README.
You can use the sentinel-metric-score
command to score translations with our metrics. For example, to use a SENTINELCAND metric:
echo -e 'Today, I consider myself the luckiest man on the face of the earth.\nI'"'"'m walking here! I'"'"'m walking here!' > sys1.txt
echo -e 'Today, I consider myself the lucky man\nI'"'"'m walking here.' > sys2.txt
sentinel-metric-score --sentinel-metric-model-name sapienzanlp/sentinel-cand-mqm -t sys1.txt sys2.txt
Output:
# input source sentences: 0 # input candidate translations: 4 # input reference translations: 0.
MT system: sys1.txt Segment idx: 0 Metric segment score: 0.4837.
MT system: sys2.txt Segment idx: 0 Metric segment score: 0.4722.
MT system: sys1.txt Segment idx: 1 Metric segment score: 0.0965.
MT system: sys2.txt Segment idx: 1 Metric segment score: 0.2735.
MT system: sys1.txt Metric system score: 0.2901.
MT system: sys2.txt Metric system score: 0.3729.
For a SENTINELSRC metric instead:
echo -e "本文件按照 GB/T 1.1 一 202 久标准化工作导则第工部分:标准化文件的结构和起草规则的规定起惠。\n增加了本文件适用对象(见第 1 章)," > src.txt
sentinel-metric-score --sentinel-metric-model-name sapienzanlp/sentinel-src-mqm -s src.txt
Output:
# input source sentences: 2 # input candidate translations: 0 # input reference translations: 0.
MT system: SOURCE Segment idx: 0 Metric segment score: 0.1376.
MT system: SOURCE Segment idx: 1 Metric segment score: 0.5106.
MT system: SOURCE Metric system score: 0.3241.
You can also score data samples from the test sets of the WMT Metrics Shared Tasks. For example:
sentinel-metric-score \
--sentinel-metric-model-name sapienzanlp/sentinel-cand-mqm \
--batch-size 128 \
--testset-name wmt23 \
--lp zh-en \
--ref-to-use refA \
--include-human \
--include-outliers \
--include-ref-to-use \
--only-system \
--out-path data/metrics_results/metrics_outputs/wmt23/zh-en/SENTINEL_CAND_MQM
Output:
lp = zh-en.
# segs = 1976.
# systems = 17.
# metrics = 0.
Std annotation type = mqm.
# refs = 2.
std ref = refA.
# MT systems to score in wmt23 for zh-en lp = 17.
No domain is specified.
# input source sentences: 1976 # input candidate translations: 33592 # input reference translations: 1976.
MT system: ONLINE-M Metric system score: -0.0576.
MT system: ONLINE-Y Metric system score: 0.0913.
MT system: NLLB_MBR_BLEU Metric system score: 0.0725.
MT system: Yishu Metric system score: 0.0703.
MT system: ONLINE-B Metric system score: 0.066.
MT system: ONLINE-G Metric system score: 0.0441.
MT system: refA Metric system score: -0.0193.
MT system: IOL_Research Metric system score: -0.0121.
MT system: ANVITA Metric system score: 0.0603.
MT system: HW-TSC Metric system score: 0.0866.
MT system: GPT4-5shot Metric system score: 0.1905.
MT system: ONLINE-W Metric system score: -0.0017.
MT system: synthetic_ref Metric system score: 0.0667.
MT system: Lan-BridgeMT Metric system score: 0.1543.
MT system: ONLINE-A Metric system score: 0.0231.
MT system: NLLB_Greedy Metric system score: 0.0836.
MT system: ZengHuiMT Metric system score: -0.0446.
--out-path
points to the directory where the segment and system scores returned by the metric will be saved (seg_scores.pickle
and sys_scores.pickle
). Furthermore, you can provide the path to the model checkpoint using --sentinel-metric-model-checkpoint-path
instead of specifying the Hugging Face model name with --sentinel-metric-model-name
. Output scores can also be saved to a json file using the --to-json
argument. Additionally, this command supports COMET metrics, which can be used with --comet-metric-model-name
or --comet-metric-model-checkpoint-path
argument.
For a complete description of the command (including also scoring csv data and limiting the evaluation to some specific WMT domain), you can use the help
argument:
sentinel-metric-score --help
The sentinel-metric-compute-wmt23-ranking
command computes the WMT23 metrics ranking. For example, to compute the segment-level metrics ranking:
sentinel-metric-compute-wmt23-ranking \
--metrics-to-evaluate-info-filepath data/metrics_results/metrics_info/metrics_info_for_ranking.tsv \
--metrics-outputs-path data/metrics_results/metrics_outputs/wmt23 \
--k 0 \
--only-seg-level \
> data/metrics_results/metrics_rankings/seg_level_wmt23_final_ranking.txt
To group-by-item (Segment Grouping in the paper) when computing the segment-level Pearson correlation, use --item-for-seg-level-pearson
. The output is located in data/metrics_results/metrics_rankings/item_group_seg_level_wmt23_final_ranking.txt. You also have the option to limit the segment-level ranking to using the Pearson correlation only, excluding the accuracy measure introduced by Deutsch et al. (2023). To do this, use the --only-pearson
flag. The output files will be located at data/metrics_results/metrics_rankings/only_pearson_seg_level_wmt23_final_ranking.txt and data/metrics_results/metrics_rankings/only_item_group_pearson_seg_level_wmt23_final_ranking.txt.
You can add other MT metrics to this comparison by creating new folders in data/metrics_results/metrics_outputs/wmt23 for each language pair, containing their segment-level and system-level scores (check how the seg_scores.pickle
and sys_scores.pickle
files are created in sentinel_metric/cli/score.py). To do this, you also have to include their info in the data/metrics_results/metrics_info/metrics_info_for_ranking.tsv file, specifying the metric name, the name of the folder containing its scores, and what gold references have been employed (or src
if the metric is reference-free).
For a complete description of this command, execute:
sentinel-metric-compute-wmt23-ranking --help
The sentinel-metric-compute-wmt-corrs
command can computes the metrics rankings on WMT for all possible combinations of correlation function and grouping strategy in a given language pair. For example, for zh-en language direction, in WMT23, you can use the following command:
sentinel-metric-compute-wmt-corrs \
--metrics-to-evaluate-info-filepath data/metrics_results/metrics_info/metrics_info_for_wmt_corrs.tsv \
--testset-name wmt23 \
--lp zh-en \
--ref-to-use refA \
--primary-metrics \
--k 0 \
> data/metrics_results/wmt_corrs/wmt23/zh-en.txt
Similar to the previous command, you can include additional MT metrics by creating the necessary folders for the desired language pair and adding their info in the data/metrics_results/metrics_info/metrics_info_for_wmt_corrs.tsv file. For each new metric, you have to specify its name, whether it is reference-free, and the path to the folder containing its scores.
For a complete description of the command, execute:
sentinel-metric-compute-wmt-corrs --help
The sentinel-metric-compute-corrs-matrix
command computes the correlations matrix for MT metrics in a given language pair, similar to the ones in the Appendix of our paper. To use it, two additional packages are required:
pip install matplotlib==3.9.1 seaborn==0.13.2
Then, considering zh-en language direction in WMT23 as an example, you can execute the following command:
sentinel-metric-compute-corrs-matrix \
--metrics-to-evaluate-info-filepath data/metrics_results/metrics_info/metrics_info_for_corrs_matrix.tsv \
--testset-name wmt23 \
--lp zh-en \
--ref-to-use refA \
--out-file data/metrics_results/corr_matrices/wmt23/zh-en.pdf
To specify which MT metrics to include in the correlations matrix, you can edit data/metrics_results/metrics_info/metrics_info_for_corrs_matrix.tsv, specifying each metric's name, whether it is reference-free, and the path to its scores (None
if already included in WMT).
For a complete description of this command, execute:
sentinel-metric-compute-corrs-matrix --help
The sentinel-metric-train
trains a new sentinel metric:
sentinel-metric-train --cfg configs/models/sentinel_regression_metric_model.yaml --wandb-logger-entity WANDB_ENTITY
Edit the files in the configs directory to customize the training process. You can also start the training from a given model checkpoint (--load-from-checkpoint
).
For a complete description of the command, execute:
sentinel-metric-train --help
from sentinel_metric import download_model, load_from_checkpoint
model_path = download_model("sapienzanlp/sentinel-cand-mqm")
model = load_from_checkpoint(model_path)
data = [
{"mt": "This is a candidate translation."},
{"mt": "This is another candidate translation."}
]
output = model.predict(data, batch_size=8, gpus=1)
Output:
# Segment scores
>>> output.scores
[0.347846657037735, 0.22583423554897308]
# System score
>>> output.system_score
0.28684044629335403
This work has been published at ACL 2024 (Main Conference). If you use any part, please consider citing our paper as follows:
@inproceedings{perrella-etal-2024-guardians,
title = "Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!",
author = "Perrella, Stefano and Proietti, Lorenzo and Scir{\`e}, Alessandro and Barba, Edoardo and Navigli, Roberto",
editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.856",
pages = "16216--16244",
}
This work is licensed under Creative Commons Attribution-ShareAlike-NonCommercial 4.0.