Abstract
We investigate the reasoning capabilities of large language models (LLMs) forautomatically generating data-cleaning workflows. To evaluate LLMs' ability tocomplete data-cleaning tasks, we implemented a pipeline for LLM-based Auto DataCleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operationsto repair three types of data quality issues: duplicates, missing values, andinconsistent data formats. Given a dirty table and a purpose (expressed as aquery), this pipeline generates a minimal, clean table sufficient to addressthe purpose and the data cleaning workflow used to produce the table. Theplanning process involves three main LLM-driven components: (1) Select TargetColumns: Identifies a set of target columns related to the purpose. (2) InspectColumn Quality: Assesses the data quality for each target column and generatesa Data Quality Report as operation objectives. (3) Generate Operation &Arguments: Predicts the next operation and arguments based on the data qualityreport results. Additionally, we propose a data cleaning benchmark to evaluatethe capability of LLM agents to automatically generate workflows that addressdata cleaning purposes of varying difficulty levels. The benchmark comprisesthe annotated datasets as a collection of purpose, raw table, clean table, datacleaning workflow, and answer set. In our experiments, we evaluated three LLMsthat auto-generate purpose-driven data cleaning workflows. The results indicatethat LLMs perform well in planning and generating data-cleaning workflowswithout the need for fine-tuning.