Abstract
Although Large Language Models (LLMs) are becoming increasingly powerful,they still exhibit significant but subtle weaknesses, such as mistakes ininstruction-following or coding tasks. As these unexpected errors could lead tosevere consequences in practical deployments, it is crucial to investigate thelimitations within LLMs systematically. Traditional benchmarking approachescannot thoroughly pinpoint specific model deficiencies, while manualinspections are costly and not scalable. In this paper, we introduce a unifiedframework, AutoDetect, to automatically expose weaknesses in LLMs acrossvarious tasks. Inspired by the educational assessment process that measuresstudents' learning outcomes, AutoDetect consists of three LLM-powered agents:Examiner, Questioner, and Assessor. The collaboration among these three agentsis designed to realize comprehensive and in-depth weakness identification. Ourframework demonstrates significant success in uncovering flaws, with anidentification success rate exceeding 30% in prominent models such as ChatGPTand Claude. More importantly, these identified weaknesses can guide specificmodel improvements, proving more effective than untargeted data augmentationmethods like Self-Instruct. Our approach has led to substantial enhancements inpopular LLMs, including the Llama series and Mistral-7b, boosting theirperformance by over 10% across several benchmarks. Code and data are publiclyavailable at https://github.com/thu-coai/AutoDetect.