Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

  • 2024-12-09 22:23:41
  • Alex Havrilla, Andrew Dai, Laura O'Mahony, Koen Oostermeijer, Vera Zisler, Alon Albalak, Fabrizio Milo, Sharath Chandra Raparthy, Kanishk Gandhi, Baber Abbasi, Duy Phung, Maia Iyer, Dakota Mahan, Chase Blagden, Srishti Gureja, Mohammed Hamdy, Wen-Ding Li, Giovanni Paolini, Pawan Sasanka Ammanamanchi, Elliot Meyerson
  • 0

Abstract

Synthetic data generation with Large Language Models is a promising paradigmfor augmenting natural data over a nearly infinite range of tasks. Given thisvariety, direct comparisons among synthetic data generation algorithms arescarce, making it difficult to understand where improvement comes from and whatbottlenecks exist. We propose to evaluate algorithms via the makeup ofsynthetic data generated by each algorithm in terms of data quality, diversity,and complexity. We choose these three characteristics for their significance inopen-ended processes and the impact each has on the capabilities of downstreammodels. We find quality to be essential for in-distribution modelgeneralization, diversity to be essential for out-of-distributiongeneralization, and complexity to be beneficial for both. Further, we emphasizethe existence of Quality-Diversity trade-offs in training data and thedownstream effects on model performance. We then examine the effect of variouscomponents in the synthetic data pipeline on each data characteristic. Thisexamination allows us to taxonomize and compare synthetic data generationalgorithms through the components they utilize and the resulting effects ondata QDC composition. This analysis extends into a discussion on the importanceof balancing QDC in synthetic data for efficient reinforcement learning andself-improvement algorithms. Analogous to the QD trade-offs in training data,often there exist trade-offs between model output quality and output diversitywhich impact the composition of synthetic data. We observe that many models arecurrently evaluated and optimized only for output quality, thereby limitingoutput diversity and the potential for self-improvement. We argue thatbalancing these trade-offs is essential to the development of futureself-improvement algorithms and highlight a number of works making progress inthis direction.