INTRODUCTION
Deep learning models generally output a single class label for each case (e.g., "class A"). However, considering multiple possible classes with a known confidence level in clinical practice can be more valuable for decision-making and uncertainty estimation. Conformal Prediction (CP) [1] is a statistical framework that enhances deep learning by providing a set of possible classes while guaranteeing that the correct class will be included in this set with a predetermined probability that is chosen by the user (coverage). Smaller prediction sets with high coverage across classes indicate higher model certainty. We apply this approach to central spinal stenosis classification on sagittal lumbar MRIs, where understanding uncertainty between four classes (no stenosis, mild, moderate, severe) might provide some insights into the most difficult cases.
METHODS
We used the SpineNet model [2] on an MRI dataset with 1689 patients (mean age of 72 ± 8.8 years and 58% female) in our clinic to compute the four-class stenosis classification (16% no stenosis, 46% mild, 18% moderate, 20% severe). To apply CP, the raw softmax output of SpineNet should be extracted. The dataset was divided into calibration (50%) and test (50%) sets. The calibration set was used to learn appropriate thresholds for creating prediction sets while the final prediction was performed on the test set. Four conformal prediction methods were evaluated: naive conformal prediction, adaptive prediction sets (APS), top-k conformal prediction, and class-conditional conformal prediction with class-specific thresholds. Each method was calibrated to provide prediction sets with a target coverage of 80%. Effective coverage and prediction set sizes were used to evaluate the performance of the conformal prediction algorithm.
RESULTS
The naive method achieved perfect coverage (100%) for no stenosis cases with an average prediction set size of 1.5, but showed lower coverage for mild (67%) and moderate stenosis (62%)(Figure 1). The APS method improved coverage for mild stenosis (77%) at the cost of slightly larger prediction sets (average size 2.36). The top-k method maintained the highest overall coverage (72-99%) across all classes but consistently required larger prediction sets (size 3.0). The class-conditional approach achieved the most balanced performance, maintaining approximately 80% coverage across all classes with the smallest average prediction set sizes (1.43-2.4).
DISCUSSION
The class-conditional method demonstrated the best balance between prediction set size and coverage guarantee, making it particularly suitable for clinical applications. The size of the prediction sets provides valuable insights into model uncertainty: larger sets indicate higher diagnostic uncertainty while smaller ones reflect more confident predictions. This uncertainty quantification is particularly informative in borderline cases between stenosis severity levels. The flexibility in setting the coverage level (currently 80%) allows clinicians to adjust the trade-off between prediction certainty and specificity based on clinical requirements. Future work should focus on reducing prediction set sizes while maintaining coverage guarantees, particularly for moderate and severe stenosis cases where accuracy is crucial. Furthermore, while our approach demonstrated effectiveness without model retraining, future studies with larger datasets could explore the benefits of training, calibrating, and testing an end-to-end model.