ReLearn: Unlearning via Learning for Large Language Models

Abstract

Current unlearning methods for large language models usually rely on reverseoptimization to reduce target token probabilities. However, this paradigmdisrupts the subsequent tokens prediction, degrading model performance andlinguistic coherence. Moreover, existing evaluation metrics overemphasizecontextual forgetting while inadequately assessing response fluency andrelevance. To address these challenges, we propose ReLearn, a data augmentationand fine-tuning pipeline for effective unlearning, along with a comprehensiveevaluation framework. This framework introduces Knowledge Forgetting Rate (KFR)and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, andLinguistic Score (LS) to evaluate generation quality. Our experiments show thatReLearn successfully achieves targeted forgetting while preserving high-qualityoutput. Through mechanistic analysis, we further demonstrate how reverseoptimization disrupts coherent text generation, while ReLearn preserves thisessential capability. Code is available at https://github.com/zjunlp/unlearn.