Automated evaluation of pulmonary lesion changes on chest radiograph during follow-up using semantic segmentation
PDF
Cite
Share
Request
Artificial Intelligence And Informatics - Original Article
E-PUB
1 December 2025

Automated evaluation of pulmonary lesion changes on chest radiograph during follow-up using semantic segmentation

Diagn Interv Radiol . Published online 1 December 2025.
1. University of Ulsan Faculty of Medicine, Department of Biomedical Engineering, AMIST, Asan Medical Center, Seoul, Republic of Korea
2. University of Ulsan Faculty of Medicine, Department of Convergence Medicine, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, Seoul, Republic of Korea
3. University of Ulsan Faculty of Medicine, Department of Radiology and Research Institute of Radiology, Asan Medical Center, Seoul, Republic of Korea
4. University of Ulsan Faculty of Medicine, Health Screening and Promotion Center, Asan Medical Center, Seoul, Republic of Korea
5. Bigdata Research Center, Asan Institute for Life Science, Asan Medical Center, Seoul, Republic of Korea
No information available.
No information available
Received Date: 22.07.2025
Accepted Date: 23.09.2025
E-Pub Date: 01.12.2025
PDF
Cite
Share
Request

ABSTRACT

PURPOSE

To develop and validate a deep learning-based model utilizing lesion-specific segmentation to determine the changed/unchanged status of consolidation and pleural effusion in paired chest radiographs (CRs).

METHODS

The model was trained using 5.178 CRs from a single institution for lesion segmentation. Paired CRs from the emergency department (ED) and intensive care unit (ICU) were used to determine the thresholds for change and temporal validation. Model performance was evaluated through the area under the receiver operating characteristic curve (AUC), and its accuracy was compared with that of a thoracic radiologist.

RESULTS

In the ED, the model achieved AUCs of 0.988 and 0.883 for consolidation and pleural effusion, respectively, with accuracies of 0.900 (36/40) and 0.825 (33/40). The radiologist showed accuracies of 0.975 (39/40) and 0.950 (38/40), respectively. In the ICU, model AUCs were 0.970 (consolidation) and 0.955 (pleural effusion), with accuracies of 0.875 (35/40) and 0.800 (32/40), respectively. Radiologist performance was 0.975 (39/40) for consolidation and 1.000 (40/40) for pleural effusion. No significant accuracy differences were observed between the model and radiologist for consolidation in the ICU or both targets in the ED (all P > 0.05), except for pleural effusion in the ICU (P = 0.01).

CONCLUSION

The lesion-specific deep learning model was feasible for identifying interval changes in consolidation and pleural effusion on follow-up CRs.

CLINICAL SIGNIFICANCE

It could potentially be utilized for prioritizing interpretation, generating alerts, and extracting time-series data from multiple follow-up CRs.

Keywords:
Radiography, thoracic, follow-up studies, diagnosis, computer-assisted, artificial intelligence, segmentation

Main points

• Using lesion-specific segmentation, a deep-learning model determines consolidation and pleural effusion changes in chest radiographs (CRs) by assessing changes in their extent.

• A deep-learning model achieved an area under the curve of 0.970–0.988 for determining the changed/unchanged status of consolidation and 0.883–0.955 for pleural effusion in follow-up CRs from emergency department and intensive care unit datasets.

• With a predefined threshold, the model demonstrated an accuracy of 0.875–0.900 for changed/unchanged determination in consolidation and 0.800–0.825 for pleural effusion.

Chest radiography is a widely used medical imaging modality due to its cost-effectiveness and low radiation exposure. Chest radiographs (CRs) detect thoracic abnormalities and track changes during follow-ups. Monitoring abnormalities such as pleural effusion or consolidation is crucial for evaluating disease progression and treatment response.1-4 However, frequent follow-up CRs increase workload. For example, in intensive care units (ICUs), CR is often performed daily for patients who are critically ill or after device adjustments, generating millions of ICU CRs annually in the United States.5, 6 Consequently, the timely and accurate interpretation of follow-up CRs is becoming more challenging.

Since follow-up CRs primarily detect changes between exams, analyzing CR pairs rather than relying solely on single-image abnormality detection is necessary. One line of currently developed deep learning methods detects overall changes using image registration to identify all CR findings.7, 8 It operates independently of detectable abnormality types and lesion-specific segmentation performance. However, it lacks information on which lesions have changed and the nature of these changes, which are essential in clinical practice. Furthermore, in settings such as the ICU, where various medical devices are attached, even simple repositioning, addition, or removal of a device may be recorded as a change, making it difficult to accurately determine whether a true change has occurred in the finding of interest.

Some methods have targeted specific abnormalities. For example, Li et al.9 compared lung infiltration on serial CRs of patients with Coronavirus Disase-19, Huang et al.10 quantified pleural effusion severity on individual CRs, and Lim et al.11 estimated lung nodule volume from serial CRs. Although these studies demonstrated the feasibility or potential applicability of abnormality-specific monitoring, their scope was restricted to a single lesion type. An alternative approach that enables the simultaneous tracking of different abnormalities is lesion segmentation. Singh et al.12 developed a deep learning algorithm that segments specific abnormalities and determines their changed/unchanged status based on the persistence of segmentation masks for lesions. The study reported an area under the receiver operating characteristic curve (AUC) of 0.758 for evaluating changes in pulmonary opacities over follow-up CRs. However, the algorithm was unable to determine the changed/unchanged status when their extent varied despite persistence. Despite its limitations, an algorithm that autonomously detects, segments, and assesses the changed/unchanged status of various abnormalities based on the degree of observed changes would be valuable.

Therefore, this study aims to develop a deep learning-based classifier for determining changed/unchanged status in paired CRs, using automatic lesion segmentation and extent comparison for consolidation and pleural effusion, and to validate its feasibility.

Methods

This retrospective study was approved by the institutional review board of Asan Medical Center, which waived the requirement for written informed consent (approval number: 2023-0810, date: 2023-07-01). Of the 5.178 CRs used for training, 4.593 were utilized in a previous study to develop a model for detecting five abnormalities.13 However, our model is not related to the model from that study.

Training and validation datasets

In the classifier pipeline, the training set for abnormality segmentation was derived from CRs of adult patients (≥18 years) obtained at a tertiary referral hospital between January 2015 and December 2018 (Figure 1). The training set consisted of three types: normal CR, abnormal CR (with consolidation or pleural effusion), and CR with medical devices (Appendix S1). Radiologist-labeled lesion masks that had been developed and validated in the previous work were used.13 However, the lesion segmentation algorithm, the paired radiograph comparison, and the change-detection framework were newly developed in this study. During the training process for the segmentation component of the model, the training set was further divided into a 9:1 ratio for model development and tuning.

After developing a lesion segmentation algorithm, CRs obtained from the emergency department (ED) and ICU between January 2019 and December 2019 were collected to determine the changed/unchanged classifier threshold. For each patient, one pair of CRs was randomly selected while maintaining the chronological order. The pairing principle was applied regardless of the CR projection type (posteroanterior or anteroposterior). However, due to the nature of the ED and ICU settings with patients who are critically ill, most radiographs were anteroposterior. Two thoracic radiologists (BLINDED and BLINDED, with 7 and 17 years of experience in thoracic imaging, respectively), blinded to the radiologic report, interpreted the changed/unchanged status, as well as the presence of target abnormality (i.e., consolidation and pleural effusion), in queried CR pairs in a random order until the target number of each dataset was reached. Both the changed/unchanged status and type of abnormality were determined in consensus by the two radiologists.

For temporal validation of the changed/unchanged classifier, CRs obtained from the ED and ICU between January 2020 and December 2020 were collected, each containing a single abnormality (consolidation or pleural effusion). To compare the performance between the model and radiologist, another thoracic radiologist (BLINDED, with 27 years of experience in thoracic imaging) independently reviewed the temporal validation set and determined the changed/unchanged status. This review was conducted blinded to the reference standard result but with knowledge of the target abnormality type (consolidation vs. pleural effusion).

Architecture of the lesion-specific classifier

Our model included two pipelines: 1) abnormality segmentation and 2) lesion area quantification and decision-making within pairs (Figure 2). First, the nnU-Net, a U-Net-based medical segmentation model known for its robust and high performance, served as the base model. Its structure and training options were modified for enhanced pulmonary lesion segmentation performance.14 To improve the model’s generalization ability, a multi-task learning (MTL) approach that jointly performs segmentation and classification was adopted, thereby improving the model’s capability to differentiate between lesions in similar anatomical locations and medical devices and reducing potential segmentation errors. Two auxiliary classifiers were incorporated at the nnU-Net bottleneck for MTL: one for lesion presence classification and the other for lesion type classification (Appendix S2). The modified nnU-Net was trained for 1,000 epochs using 5-fold cross-validation, and the final lesion segmentation masks were generated by ensembling the inferred masks from each fold.

In the changed/unchanged classifier, lesion areas in each generated mask were quantified by multiplying the number of pixels in each lesion class by the pixel spacing of the corresponding CR. Changes in lesion quantities were calculated as the absolute difference in quantified lesion areas divided by the larger of the two values to determine the relative change. The tuning set was used to optimize the threshold for changed/unchanged decision-making (Appendix S3). The calculated ratio was then used to classify each paired image as changed/unchanged based on a predefined threshold (Supplementary Figure 1).

Model training was done using the Pytorch framework and NVIDIA NVIDIA TITAN RTX 24GB GPU (NVIDIA Corporation, Santa Clara, CA, USA). The code for the model architecture is available on GitHub (https://github.com/ provbs/CR_DL_FU/).

Statistical analysis

The segmentation performance of the model for consolidation and pleural effusion was evaluated using Dice scores. T-tests were conducted to compare models, and P values were calculated for their differences. The performance of the model in classifying changed/unchanged was evaluated using the radiologists’ results as the reference standard. The AUC was calculated, and the optimal threshold was determined using the Youden index based on the tuning set results. The accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score of the temporal validation set were then calculated using the predetermined threshold. The model and radiologist accuracy were compared using the McNemar test. All statistical analyses were conducted using R version 4.3.1 (R Foundation for Statistical Computing).

Results

Dataset characteristics

The training dataset consisted of 1.700 normal CRs, 1.223 with pleural effusion, 1.420 with consolidation, and 585 with medical devices. For the changed/unchanged classifier tuning and temporal validation, 3.699 CR pairs from the ICU and 31.846 from the ED were generated after excluding single CRs without follow-up.

In the ED dataset, 30 pleural effusion pairs (20 changed, 10 unchanged) and 30 consolidation pairs (20 changed, 10 unchanged) from 2019 were included for the changed/unchanged classifier tuning. For temporal validation, 40 pleural effusion pairs (20 changed, 20 unchanged) and 40 consolidation pairs (20 changed, 20 unchanged) from 2019 were selected. In the ICU dataset, 40 pleural effusion pairs (20 changed, 20 unchanged) and 40 consolidation pairs (20 changed, 20 unchanged) from 2019 were used for the changed/unchanged classifier tuning, whereas the same numbers of pleural effusion and consolidation pairs from 2020 were used for temporal validation.

The median interval between CR pairs was 10 days (interquartile range: 1–100 days) in the tuning set and 15 days (interquartile range: 1–72 days) in the temporal validation set. Table 1 shows the demographics in detail.

Performance of lesion segmentation

The nnU-Net with MTL (using two auxiliary classifiers) and medical equipment masks in the training dataset was the best-performing segmentation model, with Dice scores of 0.848 for pleural effusion and 0.841 for consolidation (Table 2 and Supplementary Figure 2).

Incorporating medical equipment labels during training enhanced the Dice score for consolidation by approximately 0.044, although it decreased that for pleural effusion by 0.043, resulting in no major change in the average Dice score. Nevertheless, the qualitative results showed that the model trained with medical equipment labelling considerably reduced misclassification of medical equipment as lesions, a critical distinction in ICU and ED settings. Furthermore, integrating MTL and medical equipment labels improved the average Dice score by 0.015, reducing the difference between lesion types and achieving a more balanced performance (Figure 3).

Performance of lesion-specific change detection

In the tuning set, the AUCs of the model were 0.747 for consolidation and 0.850 for pleural effusion in the ED, and 0.980 for consolidation and 0.800 for pleural effusion in the ICU (Supplementary Figure 3). To account for different clinical settings, thresholds were determined separately for the ED and ICU. The optimal thresholds derived from the tuning set were 0.26 for consolidation and 0.29 for pleural effusion in the ED and 0.40 for consolidation and 0.55 for pleural effusion in the ICU.

In the temporal validation set, the AUCs of the model were 0.988 for consolidation and 0.883 for pleural effusion in the ED and 0.970 for consolidation and 0.955 for pleural effusion in the ICU (Figure 4). The AUC for consolidation was similar between the ED and ICU, whereas the AUC for pleural effusion in the ED was slightly lower than that in the ICU.

Comparisons between the model and the thoracic radiologist

In the ED, the model achieved an accuracy of 0.900 (36/40) for consolidation, with a sensitivity of 1.000 for “changed” and a specificity of 0.800 for “unchanged.” For pleural effusion, the accuracy was 0.825 (33/40), with a sensitivity of 0.850 and specificity of 0.800 (Figure 5). The accuracy of the thoracic radiologist was 0.975 (39/40) for consolidation and 0.950 (38/40) for pleural effusion (Table 3).

In the ICU, the model achieved an accuracy of 0.875 (35/40) for consolidation, with a sensitivity of 0.900 for “changed” and a specificity of 0.850 for “unchanged.” For pleural effusion, the accuracy of the model was 0.800 (32/40), with a sensitivity of 0.600 and specificity of 1.000 (Supplementary Figures 4 and 5). The accuracy of the thoracic radiologist was 0.975 (39/40) for consolidation and 1.000 (40/40) for pleural effusion (Table 3 and Supplementary Figure 5).

When comparing the accuracy of the model and the thoracic radiologist, no significant difference was found for consolidation in the ED [0.900 (36/40) vs. 0.975 (39/40), P = 0.371], pleural effusion in the ED [0.825 (33/40) vs. 0.950 (38/40), P = 0.182], and consolidation in the ICU [0.875 (35/40) vs. 0.975 (39/40), P = 0.221]. However, for pleural effusion in the ICU, the radiologist outperformed the model [1.000 (40/40) vs. 0.800 (32/40), P = 0.013] (Supplementary Figure 6).

Discussion

Multiple CRs for follow-up are common in clinical practice. However, current interpretation techniques are largely limited to single images, and automated methods for follow-up CR analysis remain underdeveloped. In this study, we developed and validated a deep-learning model for assessing the changed/unchanged status through lesion-specific segmentation. In validation within the ED and ICU settings, the model classified the changed/unchanged status with an accuracy of 0.875–0.900 for consolidation and 0.800–0.825 for pleural effusion, comparable to that of the radiologist, except for pleural effusion in the ICU.

Interpreting follow-up CRs poses challenges for both radiologists and deep-learning algorithms due to changes in the thoracic cage caused by variations in posture or inspiration status, as well as background changes such as alterations in medical devices. Consequently, few studies focus on the automated interpretation of CR pairs.7, 8, 12 The approach of determining changes or no changes in the overall image landscape can help prioritize worklists and improve workflow efficiency,7, 8 although it lacks details on the specific objects involved or the extent of the changes. Unlike previous approaches, we aimed to develop a model that identifies specific abnormal changes. Lesion-specific interpretation is straightforward and enables the detection of clinically relevant changes, such as consolidation increases in patients with pneumonia. With further refinement, it could become a component of an autonomous reporting system. To achieve this, we focused on two major abnormalities—consolidation and pleural effusion—that are commonly monitored for treatment response. These abnormalities were tested in the ED and ICU settings, where they are more prevalent and dynamic than in outpatient clinics or general wards.

Our model achieved an AUC of 0.883–0.988, outperforming a previous study (AUC: 0.687 for pulmonary opacity changes and 0.782 for pleural effusion changes) that determined changed/unchanged status solely based on lesion persistence.12 It was also similar to prior non-lesion-specific models, which have AUCs of 0.800–0.858.7, 8 This performance may be due to the accurate lesion segmentation of our model, achieving a Dice score of up to 0.845 in the training set, and its reduced misclassification of medical devices as lesions. Singh et al.12 reported that the mis-segmentation of medical devices as pulmonary opacities is a challenge. To address this, we specifically trained our model on CRs with medical devices, ensuring robust performance in the ICU and ED settings where they are almost always present.

The accuracy of our model was similar to that of the radiologist for consolidation in the ICU and ED and for pleural effusion in the ED, though slightly lower. The decision of the radiologist on whether a condition had changed or remained unchanged closely aligns with the reference standard. Although consolidation and pleural effusion are typically assessed qualitatively in routine practice, the threshold of readers may be interchangeable. Our model showed considerably lower performance than that of the radiologist for pleural effusion in the ICU. This may be related to the position of the patient in the ICU. In patients in the supine position, both consolidation and pleural effusion can appear as diffusely increased opacity, making differentiation difficult. Pleural fluid tends to spread under gravity, making the margins of effusion indistinct. Radiologists also assess changes in pleural effusion while considering positional changes, which may be challenging for our model. Notably, all incorrect ICU pleural effusion classifications occurred in “changed” cases, whereas the model correctly identified all stable cases. We therefore consider that the model can adequately triage stable pleural effusion, but reduced sensitivity in supine patients remains an important limitation that warrants further refinement. In addition, the limited size of the tuning set (40 pairs) may have led to overfitting of the threshold for pleural effusion, contributing to skewed results. Although this should be addressed in future studies, our findings provide proof of concept for the feasibility of lesion-specific segmentation in change status detection.

Unlike preexisting non-lesion-specific models, which primarily filter grossly stable CR pairs, our lesion-specific model offers dual functionality. It can inform and prioritize changes for physician review while simultaneously filtering stable cases. Previous approaches based on registration and subtraction within pairs are limited compared with a segmentation-based method, which has the potential to be applied to multiple CRs in longitudinal follow-up, enabling the extraction of lesion extent as time-series data with quantification. Recent advances in language models have enabled training on large-scale, weakly labeled data for multi-label, multi-class change detection and even automated report generation.15, 16 In contrast, our model leverages radiologist-provided hard labeling, which ensures disease-specific accuracy and offers interpretable, intuitive visual explanations of the degree of change. These strengths may provide potential applicability, working synergistically with text generation models as part of automated reporting systems. However, our model is currently limited to two abnormalities: consolidation and pleural effusion. Expanding its capabilities to include other major abnormalities, such as nodules or interstitial opacities, may be a valuable next step. Furthermore, improving lesion segmentation and consideration of position change are warranted.

Our study has some limitations. First, as a single-center retrospective study, it may have selection bias and limited generalizability. Second, the experiment was conducted using datasets from the same institution. Although the datasets do not overlap, true external validation was not performed. Third, the tuning and temporal validation sets were relatively small. Since our model was designed specifically for consolidation and pleural effusion, only patients with at least one of these abnormalities were eligible. This may contribute to the performance differences between the tuning and temporal validation sets. Further validation in a larger population is necessary.

In conclusion, lesion-specific segmentation enables the deep-learning-based model to determine the changed/unchanged status of consolidation and pleural effusion based on changes in their extent.

Conflict of interest disclosure

The authors declared no conflicts of interest.

Funding

This work was supported by the Korea Medical Device Development Fund grant funded by the Korea government (the Ministry of Science and ICT, the Ministry of Trade, Industry and Energy, the Ministry of Health & Welfare, the Ministry of Food and Drug Safety) (Project Number: 2710002589, RS-2023-00254209).

References

1
Expert Panel on Thoracic Imaging; Morris MF, Henry TS, Raptis CA, et al. ACR appropriateness criteria® workup of pleural effusion or pleural disease. J Am Coll Radiol. 2024;21(6S):S343-S352.
2
Krishna R, Antoine MH, Alahmadi MH, Rudrappa M. Pleural Effusion. 2024 Aug 31. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–.
3
Karkhanis VS, Joshi JM. Pleural effusion: diagnosis, treatment, and management. Open Access Emerg Med. 2012;4:31-52.
4
Little BP, Gilman MD, Humphrey KL, Alkasab TK, Gibbons FK, Shepard JA, et al. Outcome of recommendations for radiographic follow-up of pneumonia on outpatient chest radiography. AJR Am J Roentgenol. 2014;202(1):54-9.
5
Gershengorn HB, Wunsch H, Scales DC, Rubenfeld GD. Trends in use of daily chest radiographs among us adults receiving mechanical ventilation. JAMA Netw Open. 2018;1(4):e181119.
6
Oba Y, Zaza T. Abandoning daily routine chest radiography in the intensive care unit: meta-analysis. Radiology. 2010;255(2):386-95.
7
Cho K, Kim J, Kim KD, et al. Music-ViT: a multi-task Siamese convolutional vision transformer for differentiating change from no-change in follow-up chest radiographs. Med Image Anal. 2023;89:102894.
8
Yun J, Ahn Y, Cho K, et al. Deep learning for automated triaging of stable chest radiographs in a follow-up setting. Radiology. 2023;309(1):e230606.
9
Li MD, Arun NT, Gidwani M, et al. Automated assessment and tracking of COVID-19 pulmonary disease severity on chest radiographs using convolutional siamese neural networks. Radiol Artif Intell. 2020;2(4):e200079.
10
Huang T, Yang R, Shen L, et al. Deep transfer learning to quantify pleural effusion severity in chest X-rays. BMC Med Imaging. 2022;22(1):100.
11
Lim CY, Cha YK, Chung MJ, et al. Estimating the volume of nodules and masses on serial chest radiography using a deep-learning-based automatic detection algorithm: a preliminary study. Diagnostics (Basel). 2023;13(12):2060.
12
Singh R, Kalra MK, Nitiwarangkul C, et al. Deep learning in chest radiography: detection of findings and presence of change. PLoS One. 2018;13(10):e0204155.
13
Park B, Cho Y, Lee G, et al. A curriculum learning strategy to enhance the accuracy of classification of various lesions in chest-PA X-ray screening for pulmonary abnormalities. Sci Rep. 2019;9(1):15352.
14
Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18(2):203-211.
15
Yu K, Ghosh S, Liu Z, Deible C, Poynton CB, Batmanghelich K. Anatomy-specific progression classification in chest radiographs via weakly supervised learning. Radiol Artif Intell. 2024;6(5):e230277.
16
Wang Z, Deng Q, So TY, Chiu WH, Lee K, Hui ES. Disease probability-enhanced follow-up chest X-ray radiology report summary generation. Sci Rep. 2025;15(1):26930.

Suplementary Materials