Abstract
The vocabulary size in temporal action localization (TAL) is limited by thescarcity of large-scale annotated datasets. To overcome this, recent worksintegrate vision-language models (VLMs), such as CLIP, for open-vocabulary TAL(OV-TAL). However, despite the success of VLMs trained on extensive datasets,existing OV-TAL methods still rely on human-labeled TAL datasets of limitedsize to train action localizers, limiting their generalizability. In thispaper, we explore the scalability of self-training with unlabeled YouTubevideos for OV-TAL. Our approach consists of two stages: (1) a class-agnosticaction localizer is trained on a human-labeled TAL dataset to generatepseudo-labels for unlabeled videos, and (2) the large-scale pseudo-labeleddataset is then used to train the localizer. Extensive experiments demonstratethat leveraging web-scale videos in self-training significantly enhances thegeneralizability of an action localizer. Additionally, we identify limitationsin existing OV-TAL evaluation schemes and propose a new benchmark for thoroughassessment. Finally, we showcase the TAL performance of the large multimodalmodel Gemini-1.5 on our new benchmark. Code is released athttps://github.com/HYUNJS/STOV-TAL.