download
PDF
|
Deep learning in endoscopy: the importance of standardisation
Abstract
Article
Dear Editor,
In recent years, there has been an explosion of interest in the use of deep learning (DL) for medical image analysis. Different DL architectures have been proposed to address a variety of tasks, including image classification, object detection, segmentation and characterisation 1. These advancements have led to significant progress in assisting clinicians in the evaluation of endoscopic frames, giving birth to a field of artificial intelligence-associated applications called “videomics” 2. However, a major challenge is that the performance of these methods is often not directly comparable among different institutions/series because of the lack of standardised evaluation methodology.
The importance of standardising outcomes in DL for medical imaging cannot be overemphasised and its lack is a major barrier to translation into actual clinical practice of such algorithms for videomics. To enable fair algorithm comparisons, it is essential to have common evaluation metrics that are agreed upon by the research community from both clinical and technical perspectives 3.
The dataset used for algorithm training, validation and testing should be representative of the real-world clinical scenario, possibly including data from different centres and annotated by different clinicians.
Herein, we propose a general guideline to standardise reporting in studies focused on the automatic analysis of endoscopic images (i.e. videomics). Figure 1 shows a schematic framework aimed at addressing study definition and outcome metrics.
Study definition
The first step is to clearly describe the setting and methodology:
- Objective: clearly state the type of task to be assessed.
- Algorithm: describe the DL algorithm, its architecture, and technical features (e.g. loss function, optimisation, learning rate, batch size, validation metric, strategy to stop training, training curves).
- Describe the number of patients and number of frames extracted from each (with selection criteria). Technical information regarding endoscopic and recording equipment, as well as the use of optical filters (e.g. narrow band imaging), should also be available.
- The training-validation-test set split ratio should be described, specifying the type of cross-validation. Patients should be clustered in different sets.
Definition of primary outcomes
A further step is to clearly define the study outcomes by selecting measures that are actually representative of diagnostic performance. Figure 1 shows the suggested metrics for each task, selecting those with a lower risk of resulting in an illegitimate high score. Adjunctive metrics can be added, but should not be considered as primary outcomes unless there is a clear rationale.
Regarding classification, thanks to the clear dichotomic distinction between positive and negative results, the same rationale used for standard diagnostic tests should be applied. A wide range of metrics (Fig. 1) should be provided to allow a comprehensive assessment of the algorithm’s diagnostic characteristics. In addition, in case of class unbalance, appropriate metrics should be considered (e.g. F1-score instead of accuracy, precision-recall curve).
Conversely, special considerations should be applied to detection and segmentation tasks. It is essential to define the percentage of frames in which a region of interest (ROI) has been detected. Here, the most commonly used parameter is accuracy (Acc), which measures the percentage of correctly classified pixels. However, endoscopic frames are commonly highly class-imbalanced. Frames usually contain a single region of interest involving only a small portion of pixels, whereas the remaining image is labelled as background. Because of the positive impact of true negatives (i.e. background), Acc will always result in high scores. Similarly, specificity (Spec) indicates the model capability to detect the background in an image. Due to the large fraction of pixels annotated as background compared to the ROI, specificity values close to 1 are typical. For this reason, Acc and Spec should not be used as primary outcomes.
Sensitivity (also defined as Recall) is another popular metric, but in detection and segmentation tasks it is less representative than overlap metrics such as the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). In particular, the DSC is calculated by taking the number of data points that are shared by both sets and dividing it by the total number of data points; IoU is calculated by taking the intersection of the predicted segmentation and the ground truth segmentation, and dividing it by the union of the two.
Finally, the Hausdorff Distance (HD) measures the distance between two sets of points (i.e. ground truth and predicted segmentation), and allows scoring localisation similarity by focusing on the delineation of margins. Since the HD is sensitive to outliers, the average HD may be better suited for most applications.
This is a first proposal to standardise reporting in videomics studies. Some technical concepts should be taken into consideration to improve collaboration and reliably assess the performance of novel DL algorithms before introducing them into a clinical setting.
Conflict of interest statement
The authors declare no conflict of interest.
Author contributions
AP, FPV, and SM conceived and designed the study; AS and CM discussed the contents and commented on the manuscript.
Ethical consideration
Not applicable.
Figures and tables
References
- Paderno A, Gennarini F, Sordi A. Artificial intelligence in clinical endoscopy: insights in the field of videomics. Front Surg. 2022; 9:933297. DOI
- Paderno A, Holsinger FC, Piazza C. Videomics: bringing deep learning to diagnostic endoscopy. Curr Opin Otolaryngol Head Neck Surg. 2021; 29:143-148. DOI
- Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imaging. 2015; 15:29. DOI
License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Copyright
© Società Italiana di Otorinolaringoiatria e chirurgia cervico facciale , 2023
- Abstract viewed - 622 times
- PDF downloaded - 266 times