Abstract
End-to-end audio-conditioned latent diffusion models (LDMs) have been widelyadopted for audio-driven portrait animation, demonstrating their effectivenessin generating lifelike and high-resolution talking videos. However, directapplication of audio-conditioned LDMs to lip-synchronization (lip-sync) tasksresults in suboptimal lip-sync accuracy. Through an in-depth analysis, weidentified the underlying cause as the "shortcut learning problem", wherein themodel predominantly learns visual-visual shortcuts while neglecting thecritical audio-visual correlations. To address this issue, we exploreddifferent approaches for integrating SyncNet supervision into audio-conditionedLDMs to explicitly enforce the learning of audio-visual correlations. Since theperformance of SyncNet directly influences the lip-sync accuracy of thesupervised model, the training of a well-converged SyncNet becomes crucial. Weconducted the first comprehensive empirical studies to identify key factorsaffecting SyncNet convergence. Based on our analysis, we introduceStableSyncNet, with an architecture designed for stable convergence. OurStableSyncNet achieved a significant improvement in accuracy, increasing from91% to 94% on the HDTF test set. Additionally, we introduce a novel TemporalRepresentation Alignment (TREPA) mechanism to enhance temporal consistency inthe generated videos. Experimental results show that our method surpassesstate-of-the-art lip-sync approaches across various evaluation metrics on theHDTF and VoxCeleb2 datasets.