Oral Presentation 51st International Society for the Study of the Lumbar Spine Annual Meeting 2025

Uncertainty Quantification in Lumbar Stenosis Classification on MRI: A Comparative Analysis of Conformal Prediction Methods (115532)

Andrea Cina 1 2 , Catherine Jutzeler 1 3 , Jacopo Vitale 2 , Daniel Haschtmann 4 , Markus Loibl 4 , Tamas Fekete 4 , Frank Kleinstük 4 , Fabio Galbusera 2
  1. Department of Health Sciences and Technology, ETH Zürich, Zürich, Switzerland
  2. Department of Teaching, Research and Development, Schulthess Klinik, Zürich, Switzerland
  3. Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
  4. Department of Spine Surgery and Neurosurgery, Schulthess Klinik, Zürich, Switzerland

INTRODUCTION

Deep learning models generally output a single class label for each case (e.g., "class A"). However, considering multiple possible classes with a known confidence level in clinical practice can be more valuable for decision-making and uncertainty estimation. Conformal Prediction (CP) [1] is a statistical framework that enhances deep learning by providing a set of possible classes while guaranteeing that the correct class will be included in this set with a predetermined probability that is chosen by the user (coverage). Smaller prediction sets with high coverage across classes indicate higher model certainty. We apply this approach to central spinal stenosis classification on sagittal lumbar MRIs, where understanding uncertainty between four classes (no stenosis, mild, moderate, severe) might provide some insights into the most difficult cases.

METHODS

We used the SpineNet model [2] on an MRI dataset with 1689 patients (mean age of 72 ± 8.8 years and 58% female) in our clinic to compute the four-class stenosis classification (16% no stenosis, 46% mild, 18% moderate, 20% severe). To apply CP, the raw softmax output of SpineNet should be extracted.  The dataset was divided into calibration (50%) and test (50%) sets. The calibration set was used to learn appropriate thresholds for creating prediction sets while the final prediction was performed on the test set. Four conformal prediction methods were evaluated: naive conformal prediction, adaptive prediction sets (APS), top-k conformal prediction, and class-conditional conformal prediction with class-specific thresholds. Each method was calibrated to provide prediction sets with a target coverage of 80%. Effective coverage and prediction set sizes were used to evaluate the performance of the conformal prediction algorithm.

RESULTS

The naive method achieved perfect coverage (100%) for no stenosis cases with an average prediction set size of 1.5, but showed lower coverage for mild (67%) and moderate stenosis (62%)(Figure 1). The APS method improved coverage for mild stenosis (77%) at the cost of slightly larger prediction sets (average size 2.36). The top-k method maintained the highest overall coverage (72-99%) across all classes but consistently required larger prediction sets (size 3.0). The class-conditional approach achieved the most balanced performance, maintaining approximately 80% coverage across all classes with the smallest average prediction set sizes (1.43-2.4).

DISCUSSION

The class-conditional method demonstrated the best balance between prediction set size and coverage guarantee, making it particularly suitable for clinical applications. The size of the prediction sets provides valuable insights into model uncertainty: larger sets indicate higher diagnostic uncertainty while smaller ones reflect more confident predictions. This uncertainty quantification is particularly informative in borderline cases between stenosis severity levels. The flexibility in setting the coverage level (currently 80%) allows clinicians to adjust the trade-off between prediction certainty and specificity based on clinical requirements. Future work should focus on reducing prediction set sizes while maintaining coverage guarantees, particularly for moderate and severe stenosis cases where accuracy is crucial. Furthermore, while our approach demonstrated effectiveness without model retraining, future studies with larger datasets could explore the benefits of training, calibrating, and testing an end-to-end model.

6736092c986f6-Figure_1.png 

  1. 1. Angelopoulos AN, Bates S (2021) A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv [cs.LG]
  2. 2. Windsor R, Jamaludin A, Kadir T, Zisserman A (2022) SpineNetV2: Automated detection, labelling and radiological grading of clinical MR scans. arXiv [eess.IV]