Abstract
This study evaluates Large Language Models (LLMs) compared to experienced medical teachers. The analysis examines the performance of three prominent LLMs—ChatGPT, Gemini, and Copilot. The study employs Fleiss’ Kappa Test to statistically analyze the concordance between LLMs and human responses. In discordance, Cohen’s Kappa test was used to find agreement between three Gen AI tools and a Medical Teacher. Results reveal a significant difference in the performance between LLMs and medical teachers, highlighting potential limitations in using AI alone for medical education.
| Original language | American English |
|---|---|
| Article number | 443 |
| Journal | BMC Medical Education |
| Volume | 25 |
| Issue number | 1 |
| DOIs | |
| State | Published - 1 Dec 2025 |
Keywords
- Generative AI
- LLM
- Machine learning
- Medical education
All Science Journal Classification (ASJC) codes
- Education
Fingerprint
Dive into the research topics of 'Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver