Abstract
We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a tokenization postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in model implementations, both as a means to reduce model size and for improving model performance through robustness, our experiments indicate that, across a large space of hyperparameter settings, vocabulary trimming fails to consistently improve model performance, and is even prone to incurring heavy degradation.
| Original language | American English |
|---|---|
| Title of host publication | Insights 2024 - 5th Workshop on Insights from Negative Results in NLP, Proceedings of the Workshop |
| Editors | Shabnam Tafreshi, Arjun Reddy Akula, Joao Sedoc, Aleksandr Drozd, Anna Rogers, Anna Rumshisky |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 48-50 |
| Number of pages | 3 |
| ISBN (Electronic) | 9798891761025 |
| State | Published - 1 Jan 2024 |
| Event | 5th Workshop on Insights from Negative Results in NLP, Insights 2024 - Mexico City, Mexico Duration: 20 Jun 2024 → … |
Publication series
| Name | Insights 2024 - 5th Workshop on Insights from Negative Results in NLP, Proceedings of the Workshop |
|---|
Conference
| Conference | 5th Workshop on Insights from Negative Results in NLP, Insights 2024 |
|---|---|
| Country/Territory | Mexico |
| City | Mexico City |
| Period | 20/06/24 → … |
All Science Journal Classification (ASJC) codes
- Language and Linguistics
- Computational Theory and Mathematics
- Computer Science Applications
- Software