ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning

Abstract

The ACPBench dataset provides atomic reasoning tasks required for efficientplanning. The dataset is aimed at distilling the complex plan generation taskinto separate atomic reasoning tasks in their easiest possible form, boolean ormultiple-choice questions, where the model has to choose the right answer fromthe provided options. While the aim of ACPBench is to test the simplest form ofreasoning about action and change, when tasked with planning, a model does nottypically have options to choose from and thus the reasoning required forplanning dictates an open-ended, generative form for these tasks. To that end,we introduce ACPBench Hard, a generative version of ACPBench, with open-endedquestions which the model needs to answer. Models that perform well on thesetasks could in principle be integrated into a planner or be used directly as apolicy. We discuss the complexity of these tasks as well as the complexity ofvalidating the correctness of their answers and present validation algorithmsfor each task. Equipped with these validators, we test the performance of avariety of models on our tasks and find that for most of these tasks theperformance of even the largest models is still subpar. Our experiments showthat no model outperforms another in these tasks and with a few exceptions alltested language models score below 65%, indicating that even the currentfrontier language models have a long way to go before they can reliably reasonabout planning. In fact, even the so-called reasoning models struggle withsolving these reasoning tasks. ACPBench Hard collection is available at thefollowing link: https://ibm.github.io/ACPBench