Skip to main navigation Skip to search Skip to main content

Uncovering Measurement Biases in LLM Embedding Spaces: The Anna Karenina Principle and Its Implications for Automated Feedback

Abigail Gurin Schleifer, Beata Beigman Klebanov, Giora Alexandron

Research output: Contribution to journalArticlepeer-review

Abstract

Large Language Models (LLMs) are becoming increasingly popular in assessment systems for analyzing and providing personalized feedback on student responses to open-ended questions. However, the quality of diagnosis provided by such systems depends heavily on the ability of the LLMs to accurately capture the subtle differences between responses that represent the key types of student reasoning, also referred to as Knowledge Profiles (KPs). In this study, we compared expert-defined KPs with data-driven clusters generated from LLM embeddings of student responses in biology. We aimed to determine whether LLM-based clusters align with the theory-driven KPs that classify responses by their level of conceptual accuracy. Our findings revealed a ‘discoverability bias’ where LLM-derived clusters captured reasonably well the high-quality responses, but failed to distinguish between the different ways student responses can be incorrect. We then traced this ‘discoverability bias’ to the representations of the KPs in the pre-trained LLM embedding space and showed that as student responses become more wrong, they become less similar in the embedding space to other responses that reveal the same type of conceptual error. Furthermore, we found a strong relationship between the quality of the KP responses (correct or various degrees of incorrect) and the shape and density of their embeddings-based representation. Specifically, we found that the lower the quality of the KP, the less similar its responses are to each other in the embedding space. This phenomenon, which we call the ‘Anna Karenina Principle’ and study in the context of automated short answer scoring, suggests that LLM embeddings may not be sufficiently sensitive out-of-the-box to the nuances that distinguish between key profiles of conceptual understanding. This limitation poses challenges for developing fair and effective LLM-based formative assessment systems.

Original languageEnglish
JournalInternational Journal of Artificial Intelligence in Education
DOIs
StatePublished Online - 30 May 2025

All Science Journal Classification (ASJC) codes

  • Education
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Uncovering Measurement Biases in LLM Embedding Spaces: The Anna Karenina Principle and Its Implications for Automated Feedback'. Together they form a unique fingerprint.

Cite this