TY - GEN
T1 - Challenging language-Dependent segmentation for Arabic
T2 - 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017
AU - Sajjad, Hassan
AU - Dalvi, Fahim
AU - Durrani, Nadir
AU - Abdelali, Ahmed
AU - Belinkov, Yonatan
AU - Vogel, Stephan
N1 - Publisher Copyright: © 2017 Association for Computational Linguistics.
PY - 2017
Y1 - 2017
N2 - Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.
AB - Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.
UR - http://www.scopus.com/inward/record.url?scp=85040627430&partnerID=8YFLogxK
U2 - https://doi.org/10.18653/v1/P17-2095
DO - https://doi.org/10.18653/v1/P17-2095
M3 - منشور من مؤتمر
T3 - ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
SP - 601
EP - 607
BT - ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Short Papers)
Y2 - 30 July 2017 through 4 August 2017
ER -