Abstract
We develop benchmarks for LLM agents that act in, learn from, and strategizein unknown environments, the specifications of which the LLM agent must learnover time from deliberate exploration. Our benchmarks consist ofdecision-making tasks derived from key problems in economics. To forestallsaturation, the benchmark tasks are synthetically generated with scalabledifficulty levels. Additionally, we propose litmus tests, a new kind ofquantitative measure for LLMs and LLM agents. Unlike benchmarks, litmus testsquantify differences in character, values, and tendencies of LLMs and LLMagents, by considering their behavior when faced with tradeoffs (e.g.,efficiency versus equality) where there is no objectively right or wrongbehavior. Overall, our benchmarks and litmus tests assess the abilities andtendencies of LLM agents in tackling complex economic problems in diversesettings spanning procurement, scheduling, task allocation, and pricing --applications that should grow in importance as such agents are furtherintegrated into the economy.