TY - JOUR
T1 - Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification
T2 - A Data-Centric Hypothesis-Testing Approach
AU - Yoffe, Gideon
AU - Dershowitz, Nachum
AU - Vishne, Ariel
AU - Sober, Barak
N1 - Publisher Copyright: © 2025 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
PY - 2025
Y1 - 2025
N2 - We introduce a data-centric hypothesis-testing framework to quantify the influence of sequentially correlated literary properties–such as thematic continuity–on textual classification tasks. Our method models label sequences as stochastic processes and uses an empirical autocovariance matrix to generate surrogate labelings that preserve sequential dependencies. This enables statistical testing to determine whether classification outcomes are primarily driven by thematic structure or by non-sequential features like authorial style. Applying this framework to a diverse corpus of English prose, we compare traditional (word n-grams and character k-mers) and neural (contrastively trained) embeddings in both supervised and unsupervised settings. Crucially, our method identifies when classifications are confounded by sequentially correlated similarity, showing that supervised and neural models are more prone to false positives–mistaking shared themes or cross-genre differences for stylistic signals. In contrast, unsupervised models using traditional features often yield high true positive rates with minimal false positives, especially in genre-consistent settings. By disentangling sequential from non-sequential influences, our approach provides a principled way to assess classification reliability. This is particularly impactful for authorship attribution, forensic linguistics, and the analysis of redacted or composite texts. Controlling for sequential correlation is essential for reducing false positives and ensuring classification outcomes reflect genuine stylistic distinctions.
AB - We introduce a data-centric hypothesis-testing framework to quantify the influence of sequentially correlated literary properties–such as thematic continuity–on textual classification tasks. Our method models label sequences as stochastic processes and uses an empirical autocovariance matrix to generate surrogate labelings that preserve sequential dependencies. This enables statistical testing to determine whether classification outcomes are primarily driven by thematic structure or by non-sequential features like authorial style. Applying this framework to a diverse corpus of English prose, we compare traditional (word n-grams and character k-mers) and neural (contrastively trained) embeddings in both supervised and unsupervised settings. Crucially, our method identifies when classifications are confounded by sequentially correlated similarity, showing that supervised and neural models are more prone to false positives–mistaking shared themes or cross-genre differences for stylistic signals. In contrast, unsupervised models using traditional features often yield high true positive rates with minimal false positives, especially in genre-consistent settings. By disentangling sequential from non-sequential influences, our approach provides a principled way to assess classification reliability. This is particularly impactful for authorship attribution, forensic linguistics, and the analysis of redacted or composite texts. Controlling for sequential correlation is essential for reducing false positives and ensuring classification outcomes reflect genuine stylistic distinctions.
UR - http://www.scopus.com/inward/record.url?scp=105005401665&partnerID=8YFLogxK
U2 - 10.1080/09296174.2025.2496172
DO - 10.1080/09296174.2025.2496172
M3 - مقالة
SN - 0929-6174
JO - Journal of Quantitative Linguistics
JF - Journal of Quantitative Linguistics
ER -