Incorporating Context into Subword Vocabularies

Shaked Yehezkel, Yuval Pinter

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SAGE, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SAGE does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SAGE improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.

Original languageAmerican English
Title of host publicationEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages623-635
Number of pages13
ISBN (Electronic)9781959429449
StatePublished - 1 Jan 2023
Event17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Dubrovnik, Croatia
Duration: 2 May 20236 May 2023

Publication series

NameEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

Conference

Conference17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Country/TerritoryCroatia
CityDubrovnik
Period2/05/236/05/23

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Software
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Incorporating Context into Subword Vocabularies'. Together they form a unique fingerprint.

Cite this