D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

Abstract

Recent advancements in instruction tuning for large language models (LLMs)suggest that a small, high-quality dataset can significantly equip LLMs withinstruction-following capabilities, outperforming large datasets often burdenedby quality and redundancy issues. However, the challenge lies in automaticallyidentifying valuable subsets from large datasets to boost both theeffectiveness and efficiency of instruction tuning. In this paper, we firstestablish data selection criteria based on three distinct aspects of datavalue: diversity, difficulty, and dependability, and then propose the D3 methodcomprising two key steps of scoring and selection. Specifically, in the scoringstep, we define the diversity function to measure sample distinctiveness andintroduce the uncertainty-based prediction difficulty to evaluate sampledifficulty by mitigating the interference of context-oriented generationdiversity. Additionally, we integrate an external LLM for dependabilityassessment. In the selection step, we formulate the D3 weighted coresetobjective, which jointly optimizes three aspects of data value to solve for themost valuable subset. The two steps of D3 can iterate multiple rounds,incorporating feedback to refine the selection focus adaptively. Experiments onthree datasets demonstrate the effectiveness of D3 in endowing LLMs withcompetitive or even superior instruction-following capabilities using less than10% of the entire dataset.