TY - GEN
T1 - Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments
AU - Strugatski, Alona
AU - Alexandron, Giora
N1 - Publisher Copyright: © 2025 Copyright held by the owner/author(s).
PY - 2025/3/3
Y1 - 2025/3/3
N2 - Generative AI is transforming the educational landscape, raising significant concerns about cheating. Despite the widespread use of multiple-choice questions (MCQs) in assessments, the detection of AI cheating in MCQ-based tests has been almost unexplored, in contrast to the focus on detecting AI-cheating on text-rich student outputs. In this paper, we propose a method based on the application of Item Response Theory (IRT) to address this gap. Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns, with AI cheating manifesting as deviations from the expected patterns of human responses. These deviations are modeled using Person-Fit Statistics (PFS). We demonstrate that this method effectively highlights the differences between human responses and those generated by premium versions of leading chatbots (ChatGPT, Claude, and Gemini), but that it is also sensitive to the amount of AI cheating in the data. Furthermore, we show that the chatbots differ in their reasoning profiles. Our work provides both a theoretical foundation and empirical evidence for the application of IRT to identify AI cheating in MCQ-based assessments.
AB - Generative AI is transforming the educational landscape, raising significant concerns about cheating. Despite the widespread use of multiple-choice questions (MCQs) in assessments, the detection of AI cheating in MCQ-based tests has been almost unexplored, in contrast to the focus on detecting AI-cheating on text-rich student outputs. In this paper, we propose a method based on the application of Item Response Theory (IRT) to address this gap. Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns, with AI cheating manifesting as deviations from the expected patterns of human responses. These deviations are modeled using Person-Fit Statistics (PFS). We demonstrate that this method effectively highlights the differences between human responses and those generated by premium versions of leading chatbots (ChatGPT, Claude, and Gemini), but that it is also sensitive to the amount of AI cheating in the data. Furthermore, we show that the chatbots differ in their reasoning profiles. Our work provides both a theoretical foundation and empirical evidence for the application of IRT to identify AI cheating in MCQ-based assessments.
UR - http://www.scopus.com/inward/record.url?scp=105000303878&partnerID=8YFLogxK
U2 - 10.1145/3706468.3706490
DO - 10.1145/3706468.3706490
M3 - منشور من مؤتمر
T3 - 15th International Conference on Learning Analytics and Knowledge, LAK 2025
SP - 817
EP - 823
BT - 15th International Conference on Learning Analytics and Knowledge, LAK 2025
T2 - 15th International Conference on Learning Analytics and Knowledge, LAK 2025
Y2 - 3 March 2025 through 7 March 2025
ER -