Beyond Believability: Accurate Human Behavior Simulation with Fine-Tuned LLMs

Abstract

Recent research shows that LLMs can simulate ``believable'' human behaviorsto power LLM agents via prompt-only methods. In this work, we focus onevaluating and improving LLM's objective ``accuracy'' rather than thesubjective ``believability'' in the web action generation task, leveraging alarge-scale, real-world dataset collected from online shopping human actions.We present the first comprehensive quantitative evaluation of state-of-the-artLLMs (e.g., DeepSeek-R1, Llama, and Claude) on the task of web actiongeneration. Our results show that fine-tuning LLMs on real-world behavioraldata substantially improves their ability to generate actions compared toprompt-only methods. Furthermore, incorporating synthesized reasoning tracesinto model training leads to additional performance gains, demonstrating thevalue of explicit rationale in behavior modeling. This work establishes a newbenchmark for evaluating LLMs in behavior simulation and offers actionableinsights into how real-world action data and reasoning augmentation can enhancethe fidelity of LLM agents.