A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data

  • 2024-12-18 16:07:04
  • Andrej Tschalzev, Sascha Marton, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt
  • 0

Abstract

Tabular data is prevalent in real-world machine learning applications, andnew models for supervised learning of tabular data are frequently proposed.Comparative studies assessing the performance of models typically consist ofmodel-centric evaluation setups with overly standardized data preprocessing.This paper demonstrates that such model-centric evaluations are biased, asreal-world modeling pipelines often require dataset-specific preprocessing andfeature engineering. Therefore, we propose a data-centric evaluation framework.We select 10 relevant datasets from Kaggle competitions and implementexpert-level preprocessing pipelines for each dataset. We conduct experimentswith different preprocessing pipelines and hyperparameter optimization (HPO)regimes to quantify the impact of model selection, HPO, feature engineering,and test-time adaptation. Our main findings are: 1. After dataset-specificfeature engineering, model rankings change considerably, performancedifferences decrease, and the importance of model selection reduces. 2. Recentmodels, despite their measurable progress, still significantly benefit frommanual feature engineering. This holds true for both tree-based models andneural networks. 3. While tabular data is typically considered static, samplesare often collected over time, and adapting to distribution shifts can beimportant even in supposedly static data. These insights suggest that researchefforts should be directed toward a data-centric perspective, acknowledgingthat tabular data requires feature engineering and often exhibits temporalcharacteristics. Our framework is available under:https://github.com/atschalz/dc_tabeval.