Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology?

Stav Klein, Reut Tsarfaty

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This work investigates the most basic units that underlie contextualized word embeddings, such as BERT - the so-called word pieces. In Morphologically-Rich Languages (MRLs) which exhibit morphological fusion and non-concatenative morphology, the different units of meaning within a word may be fused, intertwined, and cannot be separated linearly. Therefore, when using word-pieces in MRLs, we must consider that: (1) a linear segmentation into sub-word units might not capture the full morphological complexity of words; and (2) representations that leave morphological knowledge on sub-word units inaccessible might negatively affect performance. Here we empirically examine the capacity of word-pieces to capture morphology by investigating the task of multi-tagging in Hebrew, as a proxy to evaluating the underlying segmentation. Our results show that, while models trained to predict multi-tags for complete words outperform models tuned to predict the distinct tags of WPs, we can improve the WPs tag prediction by purposefully constraining the word-pieces to reflect their internal functions. We conjecture that this is due to the naïve linear tokenization of words into word-pieces, and suggest that linguistically-informed word-pieces schemes, that make morphological knowledge explicit, might boost performance for MRLs.

Original languageEnglish
Title of host publicationSIGMORPHON 2020 - 17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, Proceedings of the Workshop
PublisherAssociation for Computational Linguistics (ACL)
Pages204-209
Number of pages6
ISBN (Electronic)9781952148194
DOIs
StatePublished - 2020
Event17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, SIGMORPHON 2020 as part of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Virtual, Online, United States
Duration: 10 Jul 2020 → …

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics

Conference

Conference17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, SIGMORPHON 2020 as part of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020
Country/TerritoryUnited States
CityVirtual, Online
Period10/07/20 → …

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology?'. Together they form a unique fingerprint.

Cite this