Abstract
We propose Sortformer, a novel neural model for speaker diarization, trainedwith unconventional objectives compared to existing end-to-end diarizationmodels. The permutation problem in speaker diarization has long been regardedas a critical challenge. Most prior end-to-end diarization systems employpermutation invariant loss (PIL), which optimizes for the permutation thatyields the lowest error. In contrast, we introduce Sort Loss, which enables adiarization model to autonomously resolve permutation, with or without PIL. Wedemonstrate that combining Sort Loss and PIL achieves performance competitivewith state-of-the-art end-to-end diarization models trained exclusively withPIL. Crucially, we present a streamlined multispeaker ASR architecture thatleverages Sortformer as a speaker supervision model, embedding speaker labelestimation within the ASR encoder state using a sinusoidal kernel function.This approach resolves the speaker permutation problem through sortedobjectives, effectively bridging speaker-label timestamps and speaker tokens.In our experiments, we show that the proposed multispeaker ASR architecture,enhanced with speaker supervision, improves performance via adapter techniques.Code and trained models will be made publicly available via the NVIDIA NeMoframework.