The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding

Abstract

Egocentric video has seen increased interest in recent years, as it is usedin a range of areas. However, most existing datasets are limited to a singleperspective. In this paper, we present the CASTLE 2024 dataset, a multimodalcollection containing ego- and exo-centric (i.e., first- and third-personperspective) video and audio from 15 time-aligned sources, as well as othersensor streams and auxiliary data. The dataset was recorded by volunteerparticipants over four days in a fixed location and includes the point of viewof 10 participants, with an additional 5 fixed cameras providing an exocentricperspective. The entire dataset contains over 600 hours of UHD video recordedat 50 frames per second. In contrast to other datasets, CASTLE 2024 does notcontain any partial censoring, such as blurred faces or distorted audio. Thedataset is available via https://castle-dataset.github.io/.