Dear Editor,
We read the article by Bayar-Kapıcı et al.1 regarding the diagnostic sensitivity of ChatGPT for detecting hemorrhages in cranial computed tomography (CT) scans with great interest. This study provides valuable insights into how large language models (LLMs) might support radiologists in acute clinical settings. As researchers focused on the intersection of artificial intelligence and radiology, we appreciate this contribution to the literature. Studies exploring the diagnostic boundaries of these systems are essential for understanding their potential and limitations. Recent studies indicate that multimodal LLMs demonstrate more limited performance in image interpretation than text-based reasoning.2-4 The findings of Bayar-Kapıcı et al.1 reflect these diagnostic limitations within the context of cranial CT scans.
To facilitate a more nuanced understanding of the results presented and to support the reproducibility of this research, we believe that certain technical and procedural aspects warrant further clarification. Given that the figures in the article appear to reflect the web-based version of ChatGPT, we were curious if a new chat session was started for each of the 216 images. If multiple cases were analyzed within a single session, the model may have been influenced by prior contextual information, which could, in turn, affect diagnostic accuracy in subsequent questions.
Furthermore, just as specific window settings and slice thickness are essential for clinical interpretation by radiologists, these parameters may substantially influence how LLMs process and identify findings in radiological images. Information regarding these technical standards would be highly beneficial for future comparative studies.
Regarding the model version, although the study refers to ChatGPT-4V, the research timeline from March to May 2025 suggests the potential use of more recent iterations. Considering that ChatGPT-4o was introduced in May 2024 and GPT-4.5 in February 2025, it is possible that one of these later versions was the model actually utilized during the data collection period.5 Clarifying the specific version used would provide essential context for future benchmarking and would be particularly helpful for meta-analyses evaluating the evolution of model performance in neuroimaging.
Finally, knowing the image format and whether the files were obtained via direct screenshots or through software-based conversion from DICOM to JPG or PNG would be valuable for ensuring the reproducibility of future studies. Different image formats and compression methods can substantially impact image quality and data integrity.6 These technical variations may influence the diagnostic performance of LLMs, particularly when identifying subtle radiological findings.
In conclusion, the finding that diagnostic accuracy improves significantly with guided prompts is particularly encouraging. This underscores the potential of LLMs as supportive tools when embedded within supervised diagnostic systems. We believe these details will further enhance the impact of this successful study. We thank the authors for their inspiring work.


