TY - GEN
T1 - Hume
T2 - 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016
AU - Birch, Alexandra
AU - Abend, Omri
AU - Bojar, Ondrej
AU - Haddow, Barry
N1 - Funding Information: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement 644402 (HimL). Publisher Copyright: © 2016 Association for Computational Linguistics
PY - 2016
Y1 - 2016
N2 - Human evaluation of machine translation normally uses sentence-level measures such as relative ranking or adequacy scales. However, these provide no insight into possible errors, and do not scale well with sentence length. We argue for a semantics-based evaluation, which captures what meaning components are retained in the MT output, thus providing a more fine-grained analysis of translation quality, and enabling the construction and tuning of semantics-based MT. We present a novel human semantic evaluation measure, Human UCCA-based MT Evaluation (HUME), building on the UCCA semantic representation scheme. HUME covers a wider range of semantic phenomena than previous methods and does not rely on semantic annotation of the potentially garbled MT output. We experiment with four language pairs, demonstrating HUME's broad applicability, and report good inter-annotator agreement rates and correlation with human adequacy scores.
AB - Human evaluation of machine translation normally uses sentence-level measures such as relative ranking or adequacy scales. However, these provide no insight into possible errors, and do not scale well with sentence length. We argue for a semantics-based evaluation, which captures what meaning components are retained in the MT output, thus providing a more fine-grained analysis of translation quality, and enabling the construction and tuning of semantics-based MT. We present a novel human semantic evaluation measure, Human UCCA-based MT Evaluation (HUME), building on the UCCA semantic representation scheme. HUME covers a wider range of semantic phenomena than previous methods and does not rely on semantic annotation of the potentially garbled MT output. We experiment with four language pairs, demonstrating HUME's broad applicability, and report good inter-annotator agreement rates and correlation with human adequacy scores.
UR - http://www.scopus.com/inward/record.url?scp=85040949754&partnerID=8YFLogxK
U2 - https://doi.org/10.18653/v1/d16-1134
DO - https://doi.org/10.18653/v1/d16-1134
M3 - Conference contribution
T3 - EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings
SP - 1264
EP - 1274
BT - EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings
PB - Association for Computational Linguistics (ACL)
Y2 - 1 November 2016 through 5 November 2016
ER -