RESPONSE: Benchmarking the Ability of Language Models to Undertake Commonsense Reasoning in Crisis Situation

Abstract

An interesting class of commonsense reasoning problems arises when people arefaced with natural disasters. To investigate this topic, we present\textsf{RESPONSE}, a human-curated dataset containing 1789 annotated instancesfeaturing 6037 sets of questions designed to assess LLMs' commonsense reasoningin disaster situations across different time frames. The dataset includesproblem descriptions, missing resources, time-sensitive solutions, and theirjustifications, with a subset validated by environmental engineers. Throughboth automatic metrics and human evaluation, we compare LLM-generatedrecommendations against human responses. Our findings show that evenstate-of-the-art models like GPT-4 achieve only 37\% human-evaluatedcorrectness for immediate response actions, highlighting significant room forimprovement in LLMs' ability for commonsense reasoning in crises.