Reasoning-Grounded Natural Language Explanations for Language Models

Abstract

We propose a large language model explainability technique for obtainingfaithful natural language explanations by grounding the explanations in areasoning process. When converted to a sequence of tokens, the outputs of thereasoning process can become part of the model context and later be decoded tonatural language as the model produces either the final answer or theexplanation. To improve the faithfulness of the explanations, we propose to usea joint predict-explain approach, in which the answers and explanations areinferred directly from the reasoning sequence, without the explanations beingdependent on the answers and vice versa. We demonstrate the plausibility of theproposed technique by achieving a high alignment between answers andexplanations in several problem domains, observing that language models oftensimply copy the partial decisions from the reasoning sequence into the finalanswers or explanations. Furthermore, we show that the proposed use ofreasoning can also improve the quality of the answers.