EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues

  • 2024-12-19 18:57:13
  • Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, Salman Khan
  • 0

Abstract

Automated analysis of vast Earth observation data via interactiveVision-Language Models (VLMs) can unlock new opportunities for environmentalmonitoring, disaster response, and resource management. Existing generic VLMsdo not perform well on Remote Sensing data, while the recent Geo-spatial VLMsremain restricted to a fixed resolution and few sensor modalities. In thispaper, we introduce EarthDial, a conversational assistant specifically designedfor Earth Observation (EO) data, transforming complex, multi-sensory Earthobservations into interactive, natural language dialogues. EarthDial supportsmulti-spectral, multi-temporal, and multi-resolution imagery, enabling a widerange of remote sensing tasks, including classification, detection, captioning,question answering, visual reasoning, and visual grounding. To achieve this, weintroduce an extensive instruction tuning dataset comprising over 11.11Minstruction pairs covering RGB, Synthetic Aperture Radar (SAR), andmultispectral modalities such as Near-Infrared (NIR) and infrared. Furthermore,EarthDial handles bi-temporal and multi-temporal sequence analysis forapplications like change detection. Our extensive experimental results on 37downstream applications demonstrate that EarthDial outperforms existing genericand domain-specific models, achieving better generalization across various EOtasks.