Decoding Human Preferences in Alignment: An Improved Approach to Inverse Constitutional AI

Abstract

Traditional methods for aligning Large Language Models (LLMs), such asReinforcement Learning from Human Feedback (RLHF) and Direct PreferenceOptimization (DPO), rely on implicit principles, limiting interpretability.Constitutional AI (CAI) offers an explicit, rule-based framework for guidingLLM alignment. Building on this, we refine the Inverse Constitutional AI (ICAI)algorithm, which extracts constitutions from preference datasets. By improvingprinciple generation, clustering, and embedding processes, our approachenhances the accuracy and generalizability of extracted principles acrosssynthetic and real-world datasets. Our results highlight the potential of theseprinciples to foster more transparent and adaptable alignment methods, offeringa promising direction for future advancements beyond traditional fine-tuning.