Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More

Abstract

This work concerns the path-star task, a minimal example of searching over agraph. The graph, $G$, is star-shaped with $D$ arms radiating from a startnode, $s$. A language model (LM) is given $G$, $s$, and a target node $t$,which ends one of the arms and is tasked with generating the arm containing$t$. The minimal nature of this task means only a single choice needs to bemade: which of the $D$ arms contains $t$? Decoder-only LMs fail to solve this elementary task above $1/D$ chance due toa learned shortcut that absorbs training supervision. We show how thispathology is caused by excess supervision and we present a series of solutionsdemonstrating that the task is solvable via decoder-only LMs. We find that thetask's minimal nature causes its difficulty, as it prevents task decomposition.Our solutions provide insight into the pathology and its implications for LMstrained via next-token prediction.