Visual Intention Grounding for Egocentric Assistants

Abstract

Visual grounding associates textual descriptions with objects in an image.Conventional methods target third-person image inputs and named object queries.In applications such as AI assistants, the perspective shifts -- inputs areegocentric, and objects may be referred to implicitly through needs andintentions. To bridge this gap, we introduce EgoIntention, the first datasetfor egocentric visual intention grounding. EgoIntention challenges multimodalLLMs to 1) understand and ignore unintended contextual objects and 2) reasonabout uncommon object functionalities. Benchmark results show that currentmodels misidentify context objects and lack affordance understanding inegocentric views. We also propose Reason-to-Ground (RoG) instruction tuning; itenables hybrid training with normal descriptions and egocentric intentions witha chained intention reasoning and object grounding mechanism. RoG significantlyoutperforms naive finetuning and hybrid training on EgoIntention, whilemaintaining or slightly improving naive description grounding. This advancementenables unified visual grounding for egocentric and exocentric visual inputswhile handling explicit object queries and implicit human intentions.