Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach

Gideon Yoffe, Nachum Dershowitz, Ariel Vishne, Barak Sober

Research output: Contribution to journalArticlepeer-review

Abstract

We introduce a data-centric hypothesis-testing framework to quantify the influence of sequentially correlated literary properties–such as thematic continuity–on textual classification tasks. Our method models label sequences as stochastic processes and uses an empirical autocovariance matrix to generate surrogate labelings that preserve sequential dependencies. This enables statistical testing to determine whether classification outcomes are primarily driven by thematic structure or by non-sequential features like authorial style. Applying this framework to a diverse corpus of English prose, we compare traditional (word n-grams and character k-mers) and neural (contrastively trained) embeddings in both supervised and unsupervised settings. Crucially, our method identifies when classifications are confounded by sequentially correlated similarity, showing that supervised and neural models are more prone to false positives–mistaking shared themes or cross-genre differences for stylistic signals. In contrast, unsupervised models using traditional features often yield high true positive rates with minimal false positives, especially in genre-consistent settings. By disentangling sequential from non-sequential influences, our approach provides a principled way to assess classification reliability. This is particularly impactful for authorship attribution, forensic linguistics, and the analysis of redacted or composite texts. Controlling for sequential correlation is essential for reducing false positives and ensuring classification outcomes reflect genuine stylistic distinctions.

Original languageEnglish
JournalJournal of Quantitative Linguistics
DOIs
StateAccepted/In press - 2025

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach'. Together they form a unique fingerprint.

Cite this