TY - GEN
T1 - The Denglisch Corpus of German-English Code-Switching
AU - Osmelak, Doreen
AU - Wintner, Shuly
N1 - Publisher Copyright: © 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - When multilingual speakers involve in a conversation they inevitably introduce code-switching (CS), i.e., mixing of more than one language between and within utterances. CS is still an understudied phenomenon, especially in the written medium, and relatively few computational resources for studying it are available. We describe a corpus of German-English codeswitching in social media interactions. We focus on some challenges in annotating CS, especially due to words whose language ID cannot be easily determined. We introduce a novel schema for such word-level annotation, with which we manually annotated a subset of the corpus. We then trained classifiers to predict and identify switches, and applied them to the remainder of the corpus. Thereby, we created a large-scale corpus of German-English mixed utterances with precise indications of CS points.
AB - When multilingual speakers involve in a conversation they inevitably introduce code-switching (CS), i.e., mixing of more than one language between and within utterances. CS is still an understudied phenomenon, especially in the written medium, and relatively few computational resources for studying it are available. We describe a corpus of German-English codeswitching in social media interactions. We focus on some challenges in annotating CS, especially due to words whose language ID cannot be easily determined. We introduce a novel schema for such word-level annotation, with which we manually annotated a subset of the corpus. We then trained classifiers to predict and identify switches, and applied them to the remainder of the corpus. Thereby, we created a large-scale corpus of German-English mixed utterances with precise indications of CS points.
UR - http://www.scopus.com/inward/record.url?scp=85174741044&partnerID=8YFLogxK
M3 - Conference contribution
T3 - SIGTYP 2023 - 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Proceedings of the Workshop
SP - 42
EP - 51
BT - SIGTYP 2023 - 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Proceedings of the Workshop
A2 - Beinborn, Lisa
A2 - Goswami, Koustava
A2 - Muradoglu, Saliha
A2 - Sorokin, Alexey
A2 - Kumar, Ritesh
A2 - Shcherbakov, Andreas
A2 - Ponti, Edoardo M.
A2 - Cotterell, Ryan
A2 - Vylomova, Ekaterina
T2 - 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, SIGTYP 2023, co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Y2 - 6 May 2023
ER -