ABSTRACT
PURPOSE
To assess the performance and feasibility of generative deep learning in enhancing the image quality of T2-weighted (T2W) prostate magnetic resonance imaging (MRI).
METHODS
Axial T2W images from the prostate imaging: cancer artificial intelligence dataset (n = 1,476, biologically males; n = 1,500 scans) were used, partitioned into training (n = 1300), validation (n = 100), and testing (n = 100) sets. A Pix2Pix model was trained on original and synthetically degraded images, generated using operations such as motion, Gaussian noise, blur, ghosting, spikes, and bias field inhomogeneities to enhance image quality. The efficacy of the model was evaluated by seven radiologists using the prostate imaging quality criteria to assess original, degraded, and improved images. The evaluation also included tests to determine whether the images were original or synthetically improved. Additionally, the model’s performance was tested on the in-house external testing dataset of 33 patients. The statistical significance was assessed using the Wilcoxon signed-rank test.
RESULTS
Results showed that synthetically improved images [median score (interquartile range) 4.71 (1)] were of higher quality than degraded images [3.36 (3), P = 0.0001], with no significant difference from original images [5 (1.14), P > 0.05]. Observers equally identified original and synthetically improved images as original (52% and 53%), proving the model’s ability to retain realistic attributes. External testing on a dataset of 33 patients confirmed a significant improvement (P = 0.001) in image quality, from a median score of 4 (2.286)–4.71 (1.715).
CONCLUSION
The Pix2Pix model, trained on synthetically degraded data, effectively improved prostate MRI image quality while maintaining realism and demonstrating both applicability to real data and generalizability across various datasets.
CLINICAL SIGNIFICANCE
This study critically assesses the efficacy of the Pix2Pix generative-adversarial network in enhancing T2W prostate MRI quality, demonstrating its potential to produce high-quality, realistic images indistinguishable from originals, thereby potentially advancing radiology practice by improving diagnostic accuracy and image reliability.
Main points
• The Pix2Pix generative-adversarial network (GAN) significantly improved T2-weighted prostate magnetic resonance imaging (MRI) quality while maintaining realism.
• Synthetically improved images scored higher than degraded ones when compared with the original images.
• External testing confirmed significant image quality improvement.
• Radiologists could not distinguish between original and synthetically improved images.
• The study demonstrated GANs’ potential for realistic, high-quality prostate MRI enhancement.
The prostate imaging reporting and data system (PI-RADS) and its updates prescribe best practices for the acquisition and interpretation of prostate magnetic resonance imaging (MRI) scans,1 emphasizing minimum technical requirements to ensure scan quality, which is crucial for the accurate detection of clinically significant prostate cancer. However, adherence to PI-RADS guidelines does not invariably guarantee high-quality MRI scans, as evidenced by various studies.2-4
Deep learning (DL)-based reconstruction techniques can speed up image acquisition and improve quality beyond traditional MRI methods.5-8 Nonetheless, variables such as patient characteristics, equipment quality, and the expertise of the radiology team can still result in suboptimal images.3, 9 Moreover, DL-based reconstruction typically requires newer scanner models and significant initial investments, limiting its accessibility. However, limited research has been conducted on applying DL techniques to enhance prostate MRI quality. Existing studies are often constrained by the use of single-center datasets and proprietary scoring systems, which can affect the reproducibility of their outcomes.10
In this study, we employed a generative-adversarial network (GAN) model, Pix2Pix, to enhance the quality of axial T2-weighted (T2W) prostate MRI. We used a large-scale, multi-center, and publicly available dataset, prostate imaging: cancer artificial intelligence (PI-CAI),11 allowing us to overcome some of the limitations noted in previous studies. Image quality was evaluated by multiple readers from different centers using a scoring system adopted from the newly introduced Prostate imaging quality (PI-QUAL):12 which provided a standardized assessment method. We also examined the realism of the generated images and tested the model’s performance on the in-house external testing dataset to evaluate its generalizability.
Methods
Study sample
The Acıbadem University and Healthcare Institution’s Medical Research Ethics Committee approved this retrospective study and waived the requirement for informed consent for the retrospective collection, analysis, and presentation of anonymized medical data (date: 11.02.2021, decision no: 2021-03/12).
This study utilized the publicly available PI-CAI training dataset, which consisted of 1,500 bi-parametric prostate MRI scans obtained from 1,476 biologically male individuals at 4 tertiary academic centers in the Netherlands and Norway between March 2015 and January 2018. The data from these four centers were stratified across the training, validation, and internal testing sets to ensure representation from each center in all data partitions. The examinations were stratified into three distinct groups: a development set (1400 scans, 1300 training, and 100 validation) and a testing set (100 exams). This stratification was done with careful consideration to ensure that scans from the same patient were not included across the development and testing sets. The flowchart of the study is given in Figure 1.
We also included the in-house dataset of 33 bi-parametric MRI examinations from 33 biologically male individuals as an external testing dataset in this study. The overall workflow of the study is shown in Figure 2.
Bi-parametric magnetic resonance imaging examinations
All bi-parametric MRI scans of the PI-CAI dataset were conducted using 1.5T units (n = 82) from Siemens (Aera and Avanto models, Siemens Healthcare, Erlangen, Germany) and Philips (Achieva and Intera models, Philips Healthcare, Eindhoven, the Netherlands), as well as 3T units (n = 1418) from Siemens (Skyra, TrioTim, and Prisma models, Siemens Healthcare, Erlangen, Germany) and Philips (Ingenia model, Philips Healthcare, Eindhoven, the Netherlands). These scans utilized surface coils and adhered to the PI-RADS V2 guidelines. Additional specifications regarding the MRI protocols used for the study sample are detailed in.11
The examinations of the in-house testing dataset were performed using 1.5T units from Siemens (Avanto-fit, Siemens Healthcare, Erlangen, Germany). These scans were also performed with surface coils. For this study, only axial T2W images were used for further analysis. Table 1 shows the imaging protocol for the in-house testing dataset.
Synthetic data creation
In this study, a crucial step involved the creation of a robust training dataset by applying clinically relevant MRI artifacts to create realistic low-quality T2W images. For this purpose, TorchIO library13 was used: a powerful tool specifically designed for data augmentation in medical imaging. A variety of techniques were employed to simulate commonly encountered artifacts, including motion, Gaussian noise, blur, ghosting, spikes, and bias field inhomogeneities (detailed in Supplementary S1).
All images were normalized and resized to uniform dimensions to facilitate consistent neural network training, ensuring each image had intensity values within a specific range for optimal input standardization.
The training set included images manipulated with each artifact individually, as well as in specific combinations, enabling the Pix2Pix model to learn from a wide variety of possible artifact scenarios and improve its ability to generalize across different image quality corruption. Internal testing was created using either a single artifact or a predefined combination of multiple artifacts. This setup allowed for a controlled evaluation of the model’s performance in enhancing images with known quality issues. Detailed descriptions of the data pre-processing are given in Supplementary Document S1.
Pix2Pix model
The Pix2pix model, a conditional GAN, was utilized for image-to-image translation using paired images to improve accuracy. It consisted of a generator, employing a U-Net architecture to maintain anatomical features in medical images, and a discriminator, which used a PatchGAN classifier to focus on high-frequency details and realism by evaluating small patches within the images.14, 15
The training involved an adversarial process where the generator tried to create increasingly realistic images, whereas the discriminator improved at detecting synthetically improved images. The process was governed by a combined loss function: adversarial loss ensured the images were visually indistinguishable from the original ones, and L1 loss maintained structural integrity, reducing blurring and preserving crucial details. This setup enhanced the model’s ability to produce clinically useful MRI images while retaining essential diagnostic features.
The Pix2pix model was trained using the Adam optimizer with a learning rate of 0.0002, focusing on 200 epochs where the L1 loss was emphasized initially to enhance accuracy. The training involved an adversarial setup where the generator aimed to produce images close to the original ones by minimizing both L1 and adversarial loss, whereas the discriminator sought to identify whether the image patches were original or synthetic, aiming to maximize the adversarial loss. A detailed description of the model is given in Supplementary Document S1.
The model’s performance during training was monitored using the mean absolute error (MAE) calculated between the reconstructed images and the original high-quality images. The model with the lowest MAE on the validation data was selected as the best-performing model for subsequent evaluation and application to the test set.
The best-performing model was then applied to synthetically degraded internal testing data (n = 100) and original images from the in-house external testing dataset (n = 33) to assess performance, as detailed in the subsequent sections.
Study readers
Seven readers participated in the analysis of the scans for this study. Reader 1, an expert prostate radiologist, interpreted over 300 cases annually for more than 10 years. Readers 2–7 were basic prostate readers, each handling 150–200 cases per year for 2–7 years. The classification of the readers adhered to the consensus statement of the European Society of Urogenital Radiology.16 Readers one and three were from the same center, whereas the others were based in various other hospitals, ranging from academic to non-academic settings.
Assessment criteria
The evaluation by the readers was adopted from the visual assessment criteria proposed in the PI-QUAL for T2W imaging.12 Specifically, the readers assessed the clarity with which they could delineate the capsule, seminal vesicles, ejaculatory ducts, neurovascular bundle, and sphincter muscle, awarding one point for each positively identified structure (i.e., if the structure could be seen clearly). Additionally, they awarded one point in the absence of artifacts and zero points if artifacts were present. Thus, the total score for each examination ranged from zero (the worst quality) to six points (the best quality).
Before the reading sessions, several online meetings were conducted to familiarize the readers with the PI-QUAL criteria through examples from published papers17 and to acquaint them with the reading platform. The primary aim of these sessions was to enhance their understanding of the PI-QUAL.
We used only axial T2W images for the reading sessions, as the model employed in the current study was designed to work with axial T2W images. Although this may be considered a limitation, it was consistent with earlier work, which primarily focused on axial images as they were the primary sequence used in PI-RADS assessments.
Case reading sessions
The readers used a dedicated workstation equipped with a 6-megapixel diagnostic color monitor (Radiforce RX 660, EIZO) and a dedicated browser-based platform (https://matrix.md.ai). All reviewed images were in the Digital Imaging and Communications in Medicine format.
Initially, 7 readers evaluated 300 T2W series in the internal testing set of 100 patients, consisting of 100 original, 100 synthetically degraded, and 100 synthetically improved series. The readers independently assessed the cases in a random order to minimize bias, not knowing which images were original, degraded, or improved. They assigned points to each examination based on the previously described criteria and judged whether the images were original or synthesized.
Subsequently, to further evaluate the model’s performance and its ability to enhance image quality on real data, the readers assessed the scans in the in-house external testing dataset of 33 patients, which included 33 original and 33 synthetically improved T2W series.
Statistical analysis
Statistical analyses were performed using the SciPy library in Python version 3. Continuous variables were presented using medians and interquartile ranges, whereas categorical and ordinal variables were presented with frequencies and percentages.
The structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) were used as quantitative metrics to assess image quality. The SSIM evaluated perceptual similarity by comparing luminance, contrast, and structure between images, with a range from −1 to 1, where 1 indicated perfect similarity. The PSNR measured the ratio between the maximum possible signal value and the distortion introduced, expressed in decibels, where higher values indicated better quality.
For comparing image quality assessments across original, synthetically degraded, and synthetically improved images, pairwise comparisons were conducted using the Friedman test and post-hoc Durbin-Conover test due to the matched nature of the data. For the pairwise comparison of the in-house external testing dataset, the Wilcoxon signed-rank test was used.
To evaluate the performance of radiologists in correctly identifying original versus synthetically improved images, accuracy was calculated. To analyze the differences in radiologists’ ability to detect synthetically improved versus original images, McNemar’s test was used. A P value less than 0.05 was considered statistically significant.
Results
Image quality assessment
We included 100 examinations in the testing set, each paired with their synthetically degraded and improved versions from the PI-CAI testing set. The PSNR and SSIM values of synthetically improved images [PSNR: 28.79 (32.54), SSIM: 0.92 (0.16)] were statistically significantly higher than those of the degraded image forms [PSNR: 24.87 (15.27), SSIM: 0.78 (0.13)] (PSNR: P < 0.001, SSIM: P < 0.001).
During the random blinded assessment, the observers gave median scores of 5 (1.14) to the original images, 3.36 (3) to the synthetically degraded images, and 4.71 (1) to the synthetically improved images (P = 0.0001). Pair-wise comparisons revealed that original images had a significantly higher median quality score than the synthetically degraded images (P < 0.0001). Likewise, synthetically improved images also had a higher image quality than synthetically degraded images (P < 0.0001). No statistically significant difference was found between the median image quality of the original and synthetically improved images (P = 0.37) (Figure 3a). A detailed breakdown of each reader’s median scores for original, synthetically degraded, and synthetically improved images is given in Supplementary Document S2.
Figure 4 shows a representative example of original, synthetically degraded, and synthetically improved images. More representative examples can be found in Supplementary Document 2.
Original vs. synthetic assessment
We evaluated whether the observers could discriminate between original and synthetically improved T2W images using a majority voting scheme from the PI-CAI testing set. In this test, the observers identified 52% of the original and 53% of the synthetically improved images as original, with no statistical difference, indicating that the observers could not reliably discriminate between original and synthetically improved images (P = 0.62). A detailed breakdown of each reader’s assessments on whether the images are original or synthetic is provided in Supplementary Document S2.
External testing
We evaluated whether the proposed model could also improve original images from the in-house external testing dataset. This set consisted of T2W images of 33 patients from the in-house center, where prostate images were obtained using a 1.5T scanner. The observers gave a median score of 4 (2.286) for the original images in the in-house external testing dataset. The median image quality score for this dataset was statistically lower than that for the original T2W images from the PI-CAI testing set (P = 0.009).
The proposed model improved the image quality of the original images from 4 (2.2) to 4.71 (1.7), demonstrating a statistically significant improvement (P = 0.001) (Figure 3b). Notably, after the improvement, we found no statistical difference in median image quality between the original images from the PI-CAI dataset [median: 5 (1.14)] and the synthetically improved images from the in-house dataset [median: 4.71 (1.7)] (P = 0.16). A detailed breakdown of each reader’s median scores for original and synthetically improved images for the in-house external testing dataset is given in Supplementary Document S2.
Figure 5 shows representative examples of original and synthetically improved images of a patient along with observers’ ratings from the in-house external testing dataset. More representative examples can be found in Supplementary Document S2.
Discussion
We found that the Pix2Pix model significantly improved the quality of synthetically degraded images evidenced by quantitative metrics and assessments of multiple readers with different experience levels from different institutions following the criteria adopted from the PI-QUAL. Notably, the synthetically improved images showed no statistical difference in image quality compared with the original images.
We further tested the performance of the proposed model on an external testing dataset, where it substantially increased the image quality. This demonstrates that the model not only works across different datasets but is also effective in improving image quality for original images that have not been synthetically manipulated. This finding is promising as it suggests that DL models can be trained on available datasets without the need for actual poor-quality prostate MRIs. It is important to note that the PI-CAI dataset is derived from centers in the Netherlands and Norway. This geographical restriction could limit the generalizability of our findings to other populations. Future studies should include data from more diverse geographical regions.
Another important finding was that the readers were not unambiguously able to discriminate original from synthetically improved images regardless of their experience levels, showing the proposed model did not only improve image quality but was also able to generate realistic looking images without introducing over-smoothness or plastic appearance.
Our findings diverge from those of the study by Belue et al.10, where the authors observed no qualitative improvement and the readers mostly opted for original images over synthetically improved images evidenced by expert radiologists. Belue et al.10 utilized a Cycle-GAN model and tested it using paired original images of both poor and good quality from the same patients. Moreover, they employed bespoke qualitative criteria, which they acknowledged as a significant limitation of their study.10 We propose that by systematically incorporating a variety of artifacts, our model may better learn the representations of both poor- and good-quality images, thereby effectively transforming poor-quality images into good-quality ones in a realistic manner.
The tendency of DL methods in over-smoothing diagnostic images has also been documented in studies using DL-based reconstruction methods.18 This smoothness can cause radiologists to feel uncertain about their interpretations, fearing potential loss of diagnostic information, such as the disruption of lesion appearance or visibility.18, 19 In contrast, our Pix2Pix model, trained on a meticulously prepared dataset, successfully generated realistic images, addressing these concerns by maintaining critical image details essential for accurate diagnosis. The training data included various levels of corruption for each augmentation as well as a combination of these augmentations with the corresponding good quality data. Including a combination of ghosting, spike artifacts, and bias field inhomogeneities with general Gaussian blur and noise in the training regime of the Pix2Pix model increased the robustness of our model against over-smoothing. However, our study did not explicitly evaluate the impact of image enhancement on lesion detection or characterization, which represented an essential area for future investigation.
In reflecting on the methods and results of our study, particularly in terms of experts identifying whether the images were original or synthetic, it is crucial to acknowledge the potential impact of bias. To minimize bias, we did not show the readers both the original and synthetic images simultaneously. Instead, the images were presented in a random order, and the readers were asked to determine their authenticity. A potential limitation is that readers one and three were from the same institution. Although this could introduce bias, the inclusion of readers from other centers helped mitigate this potential issue. Future work could incorporate strategies such as stratified sampling based on institutional affiliation to further address this. Intriguingly, the results suggested that the readers were essentially guessing, indicating no clear distinction between the original and synthetically improved images. However, this design may have inadvertently introduced another form of bias.
Knowing the study’s objective—to assess the realism of synthetically generated images—likely predisposed the readers to scrutinize each image more critically. This awareness could have heightened sensitivity to any minor imperfections, predisposing the readers to identify these as indicators of synthetic origin. Admittedly, it is virtually impossible to completely isolate this information from the readers since the core of our evaluation involved discerning the nature of the images, thus directly revealing the study’s design.
We openly acknowledge that the design of our study might have influenced the readers’ judgments. Recognizing this does not diminish the validity of our findings but rather enhances the transparency and integrity of our analysis. This situation underscores the need for further research to quantify and adjust for such bias, ensuring that the conclusions drawn are robust and applicable in real-world diagnostic settings. This will help in developing methodologies that better emulate the blind assessments typically conducted in clinical practice.
Several other limitations to our study warrant acknowledgment. First, our model was limited to axial T2W images and excluded other crucial sequences. Future studies could explore enhancing image quality across all sequences and integrating them into a single DL pipeline for more effective improvements.20 Although our study employed PI-QUAL V1, we acknowledge that V2.0 was released during our study period. Future studies should utilize the updated version for assessment.
Second, we used Pix2Pix due to its established use and relatively lower computational demands compared with the state-of-the-art diffusion denoising probabilistic models, which required significantly more resources. Future work will include applying advanced architectures, including transformers and diffusion models for image enhancement.
In conclusion, we demonstrated that a GAN model, Pix2Pix, trained on synthetically degraded axial T2W prostate MRI, can substantially improve image quality as evidenced by quantitative metrics and assessments from multiple readers with varying levels of experience following PI-QUAL criteria, showing no statistical difference in image quality compared with the original images. Additionally, the readers were unable to distinguish between original and synthetic images, indicating that the model did not introduce any unnatural appearance. Furthermore, the same model was able to improve image quality in an external testing dataset of original images, demonstrating its generalizability across datasets and its capability to improve both original and synthetically degraded images.