An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a tokenization postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in model implementations, both as a means to reduce model size and for improving model performance through robustness, our experiments indicate that, across a large space of hyperparameter settings, vocabulary trimming fails to consistently improve model performance, and is even prone to incurring heavy degradation.

Original languageAmerican English
Title of host publicationInsights 2024 - 5th Workshop on Insights from Negative Results in NLP, Proceedings of the Workshop
EditorsShabnam Tafreshi, Arjun Reddy Akula, Joao Sedoc, Aleksandr Drozd, Anna Rogers, Anna Rumshisky
PublisherAssociation for Computational Linguistics (ACL)
Pages48-50
Number of pages3
ISBN (Electronic)9798891761025
StatePublished - 1 Jan 2024
Event5th Workshop on Insights from Negative Results in NLP, Insights 2024 - Mexico City, Mexico
Duration: 20 Jun 2024 → …

Publication series

NameInsights 2024 - 5th Workshop on Insights from Negative Results in NLP, Proceedings of the Workshop

Conference

Conference5th Workshop on Insights from Negative Results in NLP, Insights 2024
Country/TerritoryMexico
CityMexico City
Period20/06/24 → …

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Cite this