Abstract
Background: Existing tools for reference retrieval using large language models (LLMs) frequently generate inaccurate, gray literature or fabricated citations, leading to poor accuracy. In this study, we aim to address this gap by developing a highly accurate reference retrieval system focusing on the precision and reliability of citations across five medical fields. Methods: We developed open-source multi-AI agent, literature review, and citation retrieval agents (LITERAS) designed to generate literature review drafts with accurate and confirmable citations. LITERAS integrates a search through the largest biomedical literature database (MEDLINE) via PubMed's application programming interface and bidirectional inter-agent communication to enhance citation accuracy and reliability. To evaluate its performance, we compared LITERAS to state-of-the-art LLMs, Sonar and Sonar-Pro by Perplexity AI. The evaluation covered five distinct medical disciplines Oncology, Cardiology, Rheumatology, Psychiatry, and Infectious Diseases/Public Health, focusing on the credibility, precision, and confirmation of citations, as well as the overall quality of the referenced sources. Results: LITERAS achieved near‐perfect citation accuracy (i.e., whether references match real publications) at 99.82 %, statistically indistinguishable from Sonar (100.00 %, p = 0.065) and Sonar-Pro (99.93 %, p = 0.074). When focusing on referencing accuracy (the consistency between in‐text citation details and metadata), LITERAS (96.81 %) significantly outperformed Sonar (89.07 %, p < 0.001) and matched Sonar-Pro (96.33 %, p = 0.139). Notably, LITERAS exclusively relied on Q1–Q2, peer‐reviewed journals (0 % nonacademic content), whereas Sonar contained 35.60 % (p < 0.01) nonacademic sources, and Sonar-Pro used 6.47 % (p < 0.001). However, Sonar-Pro cited higher‐impact journals than LITERAS (median impact factor (IF) = 14.70 vs LITERAS 3.70, p < 0.001). LITERAS's multi‐agent loop (2.2 ± 1.34 iterations per query) minimized hallucinations and consistently prioritized recent articles (IQR = 2023–2024). The field-specific analysis demonstrated the oncology field with the largest IF discrepancies (Sonar-Pro 42.1 vs LITERAS 4.3, p < 0.001), reflecting Sonar-Pro's preference for major consortium guidelines and high‐impact meta‐analyses. Conclusion: LITERAS demonstrated significantly higher retrieval of recent academic journal articles and generating longer summary report compared to academic search LLMs approaches in literature review tasks. This work provides insights into improving the reliability of AI-assisted literature review systems.
Original language | English |
---|---|
Article number | 110363 |
Journal | Computers in Biology and Medicine |
Volume | 192 |
DOIs | |
State | Published - Jun 2025 |
Keywords
- Agents
- Artificial intelligence
- Citations
- Large language models
- Literature review
- Multi AI agents
All Science Journal Classification (ASJC) codes
- Health Informatics
- Computer Science Applications