Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

Abstract

Humans often rely on subjective natural language to direct language models(LLMs); for example, users might instruct the LLM to write an enthusiasticblogpost, while developers might train models to be helpful and harmless usingLLM-based edits. The LLM's operational semantics of such subjective phrases --how it adjusts its behavior when each phrase is included in the prompt -- thusdictates how aligned it is with human intent. In this work, we uncoverinstances of misalignment between LLMs' actual operational semantics and whathumans expect. Our method, TED (thesaurus error detector), first constructs athesaurus that captures whether two phrases have similar operational semanticsaccording to the LLM. It then elicits failures by unearthing disagreementsbetween this thesaurus and a human-constructed reference. TED routinelyproduces surprising instances of misalignment; for example, Mistral 7B Instructproduces more harassing outputs when it edits text to be witty, and Llama 3 8BInstruct produces dishonest articles when instructed to make the articlesenthusiastic. Our results demonstrate that humans can uncover unexpected LLMbehavior by scrutinizing relationships between abstract concepts, withoutsupervising outputs directly.