Dear Editor,
I read with great interest the article titled “Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition” published in Diagnostic and Interventional Radiology.1 The study explores how large language models (LLMs) respond to multiple-choice and some image-based questions based on the Breast Imaging Reporting and Data System (BI-RADS) 5th edition and presents the impressive results achieved by these models. Research of this kind is crucial to understanding the growing potential role of artificial intelligence technologies, particularly LLMs, in radiology decision-making processes. As a contribution to the valuable findings of this study, I believe that considering the retrieval-augmented generation (RAG) approach could be beneficial for more effectively combining information retrieval and text generation in such scenarios.
Retrieval-augmented generation enables language models to address existing knowledge gaps by accessing external information sources, allowing them to generate more accurate, up-to-date, and contextually appropriate text.2 It consists of two main components: retrieval and generation. In the retrieval phase, queries are converted into vector format (e.g., using OpenAI embeddings) to create text embeddings. These vectors are then compared with pre-indexed documents using similarity search algorithms to retrieve the most relevant content (top-k retrieval). In the generation phase, the retrieved information is added to the input of the LLM, which then generates text based on this context.3, 4 This method holds strong potential, especially in fields that require complex information processing, such as radiology and detailed analyses based on BI-RADS.
In a study highlighting the effectiveness of this method in radiology, Tozuka et al.5 performed tumor, node, metastasis staging of lung cancer using LLMs with and without RAG. In this study, Google’s NotebookLM, a system incorporating RAG, achieved the highest performance in lung cancer staging. GPT-4o was also tested with and without RAG, and the use of RAG resulted in more successful outcomes across all stages of staging.5 Given the limitations of current static models–such as knowledge gaps and the risk of generating misleading content (hallucinations)–RAG offers a promising approach to mitigate these issues and provide a more practical solution in both radiology education and clinical practice.
I would like to express my gratitude once again for your study’s contribution to the field. I believe that incorporating RAG in future research, particularly in studies evaluating the knowledge level of LLMs on radiology guidelines, as in this case, could further enhance model accuracy and reliability.