DICE: A Framework for Dimensional and Contextual Evaluation of Language Models

Abstract

Language models (LMs) are increasingly being integrated into a wide range ofapplications, yet the modern evaluation paradigm does not sufficiently reflecthow they are actually being used. Current evaluations rely on benchmarks thatoften lack direct applicability to the real-world contexts in which LMs arebeing deployed. To address this gap, we propose Dimensional and ContextualEvaluation (DICE), an approach that evaluates LMs on granular,context-dependent dimensions. In this position paper, we begin by examining theinsufficiency of existing LM benchmarks, highlighting their limitedapplicability to real-world use cases. Next, we propose a set of granularevaluation parameters that capture dimensions of LM behavior that are moremeaningful to stakeholders across a variety of application domains.Specifically, we introduce the concept of context-agnostic parameters - such asrobustness, coherence, and epistemic honesty - and context-specific parametersthat must be tailored to the specific contextual constraints and demands ofstakeholders choosing to deploy LMs into a particular setting. We then discusspotential approaches to operationalize this evaluation framework, finishingwith the opportunities and challenges DICE presents to the LM evaluationlandscape. Ultimately, this work serves as a practical and approachablestarting point for context-specific and stakeholder-relevant evaluation of LMs.