LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Abstract

Mathematical equations have been unreasonably effective in describing complexnatural phenomena across various scientific disciplines. However, discoveringsuch insightful equations from data presents significant challenges due to thenecessity of navigating extremely large combinatorial hypothesis spaces.Current methods of equation discovery, commonly known as symbolic regressiontechniques, largely focus on extracting equations from data alone, oftenneglecting the domain-specific prior knowledge that scientists typically dependon. They also employ limited representations such as expression trees,constraining the search space and expressiveness of equations. To bridge thisgap, we introduce LLM-SR, a novel approach that leverages the extensivescientific knowledge and robust code generation capabilities of Large LanguageModels (LLMs) to discover scientific equations from data. Specifically, LLM-SRtreats equations as programs with mathematical operators and combines LLMs'scientific priors with evolutionary search over equation programs. The LLMiteratively proposes new equation skeleton hypotheses, drawing from its domainknowledge, which are then optimized against data to estimate parameters. Weevaluate LLM-SR on four benchmark problems across diverse scientific domains(e.g., physics, biology), which we carefully designed to simulate the discoveryprocess and prevent LLM recitation. Our results demonstrate that LLM-SRdiscovers physically accurate equations that significantly outperformstate-of-the-art symbolic regression baselines, particularly in out-of-domaintest settings. We also show that LLM-SR's incorporation of scientific priorsenables more efficient equation space exploration than the baselines. Code anddata are available: https://github.com/deep-symbolic-mathematics/LLM-SR