TY - GEN
T1 - Identifying Code-switching in Arabizi
AU - Shehadi, Safaa
AU - Wintner, Shuly
N1 - Funding Information: We thank Melinda Fricke, Yulia Tsvetkov, Yuli Zeira, and the anonymous reviewers for their valuable feedback and suggestions. This work was supported in part by grant No. 2019785 from the United States-Israel Binational Science Foundation (BSF), and by grants No. 2007960, 2007656, 2125201 and 2040926 from the United States National Science Foundation (NSF). Publisher Copyright: © 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - We describe a corpus of social media posts that include utterances in Arabizi, a Roman-script rendering of Arabic, mixed with other languages, notably English, French, and Arabic written in the Arabic script. We manually annotated a subset of the texts with word-level language IDs; this is a non-trivial task due to the nature of mixed-language writing, especially on social media. We developed classifiers that can accurately predict the language ID tags. Then, we extended the word-level predictions to identify sentences that include Arabizi (and code-switching), and applied the classifiers to the raw corpus, thereby harvesting a large number of additional instances. The result is a large-scale dataset of Arabizi, with precise indications of code-switching between Arabizi and English, French, and Arabic.
AB - We describe a corpus of social media posts that include utterances in Arabizi, a Roman-script rendering of Arabic, mixed with other languages, notably English, French, and Arabic written in the Arabic script. We manually annotated a subset of the texts with word-level language IDs; this is a non-trivial task due to the nature of mixed-language writing, especially on social media. We developed classifiers that can accurately predict the language ID tags. Then, we extended the word-level predictions to identify sentences that include Arabizi (and code-switching), and applied the classifiers to the raw corpus, thereby harvesting a large number of additional instances. The result is a large-scale dataset of Arabizi, with precise indications of code-switching between Arabizi and English, French, and Arabic.
UR - http://www.scopus.com/inward/record.url?scp=85152958715&partnerID=8YFLogxK
M3 - Conference contribution
T3 - WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop
SP - 194
EP - 204
BT - WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop
PB - Association for Computational Linguistics (ACL)
T2 - 7th Arabic Natural Language Processing Workshop, WANLP 2022 held with EMNLP 2022
Y2 - 8 December 2022
ER -