Abstract
Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM), such as the inability to correctly use gender-diverse English neopronouns (e.g., xe, zir, fae). While data scarcity is a known culprit, the precise mechanisms through which scarcity affects this behavior remain underexplored. We discover LLM misgendering is significantly influenced by Byte-Pair Encoding (BPE) tokenization, the tokenizer powering many popular LLMs. Unlike binary pronouns, BPE overfragments neopronouns, a direct consequence of data scarcity during tokenizer training. This disparate tokenization mirrors tokenizer limitations observed in multilingual and low-resource NLP, unlocking new misgendering mitigation strategies. We propose two techniques: (1) pronoun tokenization parity, a method to enforce consistent tokenization across gendered pronouns, and (2) utilizing pre-existing LLM pronoun knowledge to improve neopronoun proficiency. Our proposed methods outperform finetuning with standard BPE, improving neopronoun accuracy from 14.1% to 58.4%. Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender.
| Original language | American English |
|---|---|
| Title of host publication | Findings of the Association for Computational Linguistics |
| Subtitle of host publication | NAACL 2024 - Findings |
| Editors | Kevin Duh, Helena Gomez, Steven Bethard |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 1739-1756 |
| Number of pages | 18 |
| ISBN (Electronic) | 9798891761193 |
| DOIs | |
| State | Published - 1 Jan 2024 |
| Event | 2024 Findings of the Association for Computational Linguistics: NAACL 2024 - Mexico City, Mexico Duration: 16 Jun 2024 → 21 Jun 2024 |
Publication series
| Name | Findings of the Association for Computational Linguistics: NAACL 2024 - Findings |
|---|
Conference
| Conference | 2024 Findings of the Association for Computational Linguistics: NAACL 2024 |
|---|---|
| Country/Territory | Mexico |
| City | Mexico City |
| Period | 16/06/24 → 21/06/24 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 5 Gender Equality
All Science Journal Classification (ASJC) codes
- Computational Theory and Mathematics
- Software
Fingerprint
Dive into the research topics of 'Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver