Abstract: Evaluating Large Language Models for Mental Health (Society for Social Work and Research 30th Annual Conference Anniversary)

Evaluating Large Language Models for Mental Health

Schedule:
Friday, January 16, 2026
Liberty BR I, ML 4 (Marriott Marquis Washington DC)
* noted as presenting author
Gaurav Sinha, PhD, Assistant Professor, University of Georgia, GA
Ugur Kursuncu, PhD, Assistant Professor, Georgia State University, GA
Christopher Larrison, PhD, Associate Professor, University of Illinois at Urbana-Champaign, Urbana, IL
Background and Objective. About two-thirds of U.S. adults with a diagnosed mental health condition are unable to access treatment, despite having health insurance (Davenport et al., 2023). The challenges are compounded by shortages in mental health services, with available resources being stretched thin and long waitlists time for receiving services (Kazdin, 2017). As the demand for mental health services grows, technology offers innovative solutions to bridge these gaps. Conversational agents powered by advanced large language models (LLMs), have shown potential in providing real-time mental health support as they can comprehend and generate human-like conversations (Stade et al., 2024). However, their use in mental health, especially within clinical contexts, has sparked debates. These debates often center around the effectiveness, ethical implications, and practicality of using LLMs into mental health support (Guo et al., 2024). The primary objective of this study is to demonstrate the evaluation of LLMs through specific performance indicators, highlighting the criteria and metrics used to assess their effectiveness in this context.

Methods. We fine-tuned two LLMs (ChatGPT-3.5 and LLaMA7B) with 12 years of data from r/Anxiety, comprising top-level prompt-response pairs that indicate supportive interactions among the users. Fine-tuning process typically involves training a pre-trained model (like ChatGPT) and adjusting the model’s parameters (such as temperature and max tokens) to enhance the model’s relevance and improve performance for a specific task, in our case responding to anxiety as a peer. We then compared the performance of the fine-tuned ChatGPT-3.5 and LLaMA7B with non-finetuned baseline models using three quantitative benchmarks: linguistic quality, safety and trustworthiness, and supportiveness, which include a broad range of different metrics. Finally, we conducted Levene’s test and Welch’s ANOVA for each metric to assess if there are statistically differences among the models.

Findings. Readability metrics, such as Flesch-Kincaid and Gunning Fog, indicated that the models generally produced text accessible to a wide audience, though some responses were more complex. Semantic coherence was measured using BLEU, ROUGE, BLEURT, and BERT and showed high consistency, though minor discrepancies were noted. Safety and trustworthiness were evaluated using the GenBit Score and Toxicity Scores, which indicated the use of harmful language, which were present in the responses. Supportiveness was assessed through empathy metrics and showed varying levels of emotions in the models.

Discussion and Implications. With the rapidly growing use of LLMs in therapeutic interventions, our study demonstrates that evaluating these models is critical for ensuring they provide a safe, supportive, and effective experience to users. Given the existing gaps in mental health services, LLMs have potential in providing immediate support in times of need, however they should not be seen as a replacement for a real therapist. As such, these tools can be used as an adjunct to therapists, however continuous refinement is necessary to personalize responses, trust in the system, and user safety. More research is needed to ensure these emerging technologies can be integrated responsibly into mental healthcare and social work practice in ways that uphold the highest standards of care, ethics, and social responsibility.