Gauging Overprecision in LLMs: An Empirical Study

Abstract

Recently, overconfidence in large language models (LLMs) has garneredconsiderable attention due to its fundamental importance in quantifying thetrustworthiness of LLM generation. However, existing approaches prompt the\textit{black box LLMs} to produce their confidence (\textit{verbalizedconfidence}), which can be subject to many biases and hallucinations. Inspiredby a different aspect of overconfidence in cognitive science called\textit{overprecision}, we designed a framework for its study in black boxLLMs. This framework contains three main phases: 1) generation, 2) refinementand 3) evaluation. In the generation phase we prompt the LLM to generateanswers to numerical questions in the form of intervals with a certain level ofconfidence. This confidence level is imposed in the prompt and not required forthe LLM to generate as in previous approaches. We use various promptingtechniques and use the same prompt multiple times to gauge the effects ofrandomness in the generation process. In the refinement phase, answers from theprevious phase are refined to generate better answers. The LLM answers areevaluated and studied in the evaluation phase to understand its internalworkings. This study allowed us to gain various insights into LLMoverprecision: 1) LLMs are highly uncalibrated for numerical tasks 2){\color{blue}there is no correlation between the length of the interval and theimposed confidence level, which can be symptomatic of a a) lack ofunderstanding of the concept of confidence or b) inability to adjustself-confidence by following instructions}, {\color{blue}3)} LLM numericalprecision differs depending on the task, scale of answer and promptingtechnique {\color{blue}4) Refinement of answers doesn't improve precision inmost cases}. We believe this study offers new perspectives on LLMoverconfidence and serves as a strong baseline for overprecision in LLMs.