Abstract: Benchmarking Local Large Language Models for Child Welfare Text Analysis (Society for Social Work and Research 30th Annual Conference Anniversary)

93P Benchmarking Local Large Language Models for Child Welfare Text Analysis

Schedule:
Thursday, January 15, 2026
Marquis BR 6, ML 2 (Marriott Marquis Washington DC)
* noted as presenting author
Zia Qi, MSW, Research Technology Specialist, University of Michigan-Ann Arbor, Ann Arbor, MI
Brian Perron, PhD, Professor, University of Michigan-Ann Arbor, MI
Joseph Ryan, PhD, Professor, University of Michigan-Ann Arbor, Ann Arbor, MI
Bryan Victor, PhD, Associate Professor, Wayne State University, Detroit, MI

Background/Purpose

Child welfare agencies maintain extensive collections of unstructured text data, including investigation summaries and case notes, that could significantly inform practice and policy. These valuable resources remain underutilized due to processing limitations. While cloud-based large language models (LLMs) offer powerful analytical capabilities, they present serious data security concerns when handling sensitive child welfare information. Locally deployed LLMs provide a promising alternative, but researchers face a daunting selection process among hundreds of available models without domain-specific performance metrics. This study addresses this gap by developing and applying child welfare-specific benchmarks to systematically evaluate locally deployable LLMs on tasks essential for the analysis of child welfare text data.

Methods

Our research team developed comprehensive benchmarks across five critical domains in child welfare text analysis: identification of 1) general substance use issues, 2) opioid use specifically, 3) domestic violence indicators, 4) gun safety concerns, and 5) housing instability factors. For each domain, we created standardized classification tasks requiring models to analyze case text and identify the presence of domain-specific risk factors. We assessed each model's performance by comparing AI-assigned classifications against expert human coding, calculating reliability metrics to determine accuracy and consistency.

Results

Our evaluation revealed distinct performance patterns across the ten local models tested. Several models achieved reliability scores exceeding 0.80 with human coders, allowing optimization for speed and efficiency without sacrificing accuracy. Importantly, we found no consistent correlation between model size and effectiveness on child welfare-specific tasks. Performance varied significantly across the five benchmark domains, with models often excelling in certain areas while underperforming in others. The housing instability domain presented unique challenges, requiring models to recognize diverse indicators ranging from explicit mentions of homelessness to more subtle references to frequent moves or unstable living arrangements. Notably, several smaller models outperformed larger alternatives on specific tasks, challenging the assumption that larger models are universally superior. All benchmark results will be updated immediately before the presentation to ensure participants receive the most current information on state-of-the-art model performance.

Conclusions and Implications

These findings demonstrate that carefully selected local LLMs can effectively analyze sensitive child welfare text data while maintaining security and privacy. The variability in performance across domains highlights the importance of task-specific benchmarking when selecting models for specialized applications. By identifying models that achieve high reliability with human coders while optimizing for computational efficiency, this research provides child welfare agencies with a practical pathway to securely leverage their existing text data. This approach can transform how agencies extract insights from unstructured data to improve decision-making, resource allocation, and ultimately, outcomes for vulnerable children and families.