RESPONSE: Benchmarking the Ability of Language Models to Undertake Commonsense Reasoning in Crisis Situation

  • 2025-03-14 12:32:40
  • Aissatou Diallo, Antonis Bikakis, Luke Dickens, Anthony Hunter, Rob Miller
  • 0

Abstract

An interesting class of commonsense reasoning problems arises when people arefaced with natural disasters. To investigate this topic, we present\textsf{RESPONSE}, a human-curated dataset containing 1789 annotated instancesfeaturing 6037 sets of questions designed to assess LLMs' commonsense reasoningin disaster situations across different time frames. The dataset includesproblem descriptions, missing resources, time-sensitive solutions, and theirjustifications, with a subset validated by environmental engineers. Throughboth automatic metrics and human evaluation, we compare LLM-generatedrecommendations against human responses. Our findings show that evenstate-of-the-art models like GPT-4 achieve only 37\% human-evaluatedcorrectness for immediate response actions, highlighting significant room forimprovement in LLMs' ability for commonsense reasoning in crises.