A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG

Abstract

This study presents a systematic comparison of three approaches for theanalysis of mental health text using large language models (LLMs): promptengineering, retrieval augmented generation (RAG), and fine-tuning. Using LLaMA3, we evaluate these approaches on emotion classification and mental healthcondition detection tasks across two datasets. Fine-tuning achieves the highestaccuracy (91% for emotion classification, 80% for mental health conditions) butrequires substantial computational resources and large training sets, whileprompt engineering and RAG offer more flexible deployment with moderateperformance (40-68% accuracy). Our findings provide practical insights forimplementing LLM-based solutions in mental health applications, highlightingthe trade-offs between accuracy, computational requirements, and deploymentflexibility.