Abstract
Large language models (LLMs) are being used in economics research to formpredictions, label text, simulate human responses, generate hypotheses, andeven produce data for times and places where such data don't exist. While theseuses are creative, are they valid? When can we abstract away from the innerworkings of an LLM and simply rely on their outputs? We develop an econometricframework to answer this question. Our framework distinguishes between twotypes of empirical tasks. Using LLM outputs for prediction problems (includinghypothesis generation) is valid under one condition: no "leakage" between theLLM's training dataset and the researcher's sample. Using LLM outputs forestimation problems to automate the measurement of some economic concept(expressed by some text or from human subjects) requires an additionalassumption: LLM outputs must be as good as the gold standard measurements theyreplace. Otherwise estimates can be biased, even if LLM outputs are highlyaccurate but not perfectly so. We document the extent to which these conditionsare violated and the implications for research findings in illustrativeapplications to finance and political economy. We also provide guidance toempirical researchers. The only way to ensure no training leakage is to useopen-source LLMs with documented training data and published weights. The onlyway to deal with LLM measurement error is to collect validation data and modelthe error structure. A corollary is that if such conditions can't be met for acandidate LLM application, our strong advice is: don't.