Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

Abstract

We propose a novel approach that adapts hierarchical vision foundation modelsfor real-time ultrasound image segmentation. Existing ultrasound segmentationmethods often struggle with adaptability to new tasks, relying on costly manualannotations, while real-time approaches generally fail to matchstate-of-the-art performance. To overcome these limitations, we introduce anadaptive framework that leverages the vision foundation model Hiera to extractmulti-scale features, interleaved with DINOv2 representations to enhance visualexpressiveness. These enriched features are then decoded to produce precise androbust segmentation. We conduct extensive evaluations on six public datasetsand one in-house dataset, covering both cardiac and thyroid ultrasoundsegmentation. Experiments show that our approach outperforms state-of-the-artmethods across multiple datasets and excels with limited supervision,surpassing nnUNet by over 20\% on average in the 1\% and 10\% data settings.Our method achieves $\sim$77 FPS inference speed with TensorRT on a single GPU,enabling real-time clinical applications.