TY - GEN
T1 - Statistical machine translation with automatic identification of translationese
AU - Twitto-Shmuel, Naama
AU - Ordan, Noam
AU - Wintner, Shuly
N1 - Publisher Copyright: © EMNLP 2015. All rights reserved.
PY - 2015
Y1 - 2015
N2 - Translated texts (in any language) are so markedly different from original ones that text classification techniques can be used to tease them apart. Previous work has shown that awareness to these differences can significantly improve statistical machine translation. These results, however, required meta-information on the ontological status of texts (original or translated) which is typically unavailable. In this work we show that the predictions of translationese classifiers are as good as meta-information. First, when a monolingual corpus in the target language is given, to be used for constructing a language model, predicting the translated portions of the corpus, and using only them for the language model, is as good as using the entire corpus. Second, identifying the portions of a parallel corpus that are translated in the direction of the translation task, and using only them for the translation model, is as good as using the entire corpus. We present results from several language pairs and various data sets, indicating that these results are robust and general.
AB - Translated texts (in any language) are so markedly different from original ones that text classification techniques can be used to tease them apart. Previous work has shown that awareness to these differences can significantly improve statistical machine translation. These results, however, required meta-information on the ontological status of texts (original or translated) which is typically unavailable. In this work we show that the predictions of translationese classifiers are as good as meta-information. First, when a monolingual corpus in the target language is given, to be used for constructing a language model, predicting the translated portions of the corpus, and using only them for the language model, is as good as using the entire corpus. Second, identifying the portions of a parallel corpus that are translated in the direction of the translation task, and using only them for the translation model, is as good as using the entire corpus. We present results from several language pairs and various data sets, indicating that these results are robust and general.
UR - http://www.scopus.com/inward/record.url?scp=85025160155&partnerID=8YFLogxK
M3 - Conference contribution
T3 - 10th Workshop on Statistical Machine Translation, WMT 2015 at the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015 - Proceedings
SP - 47
EP - 57
BT - 10th Workshop on Statistical Machine Translation, WMT 2015 at the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015 - Proceedings
A2 - Bojar, Ondrej
A2 - Chatterjee, Rajan
A2 - Federmann, Christian
A2 - Haddow, Barry
A2 - Hokamp, Chris
A2 - Huck, Matthias
A2 - Logacheva, Varvara
A2 - Pecina, Pavel
PB - Association for Computational Linguistics (ACL)
T2 - 10th Workshop on Statistical Machine Translation, WMT 2015 at the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015
Y2 - 17 September 2015 through 18 September 2015
ER -