Differentially Private Steering for Large Language Model Alignment

Abstract

Aligning Large Language Models (LLMs) with human values and away fromundesirable behaviors (such as hallucination) has become increasinglyimportant. Recently, steering LLMs towards a desired behavior via activationediting has emerged as an effective method to mitigate harmful generations atinference-time. Activation editing modifies LLM representations by preservinginformation from positive demonstrations (e.g., truthful) and minimisinginformation from negative demonstrations (e.g., hallucinations). When thesedemonstrations come from a private dataset, the aligned LLM may leak privateinformation contained in those private samples. In this work, we present thefirst study of aligning LLM behavior with private datasets. Our work proposesthe Private Steering for LLM Alignment (PSA) algorithm to edit LLM activationswith differential privacy (DP) guarantees. We conduct extensive experiments onseven different benchmarks with open-source LLMs of different sizes (0.5B to7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show thatPSA achieves DP guarantees for LLM alignment with minimal loss in performance,including alignment metrics, open-ended text generation quality, andgeneral-purpose reasoning. We also develop the first Membership InferenceAttack (MIA) for evaluating and auditing the empirical privacy for the problemof LLM steering via activation editing. Our experiments support the theoreticalguarantees by showing improved guarantees for our PSA algorithm compared toseveral existing non-private techniques.